At Thu, 3 Nov 2011 02:56:04 +0100, Joerg Sonnenberger <joerg%britannica.bec.de@localhost> wrote: Subject: Re: getrusage() problems with user vs. system time reporting > > On Wed, Nov 02, 2011 at 06:23:40PM -0700, Greg A. Woods wrote: > > > > Unfortunately getbinuptime() isn't immediately looking a whole lot > > better than the statistical sampling in statclock(), though perhaps, > > with enough benchmark runtime, it is, as expected, being _much_ more > > fair at splitting between user and system time. > > Unlikely, given the nature of what get*time is. Indeed, I'm beginning to see this. So, perhaps what I'm really looking for, I think, is a simple short-span monotonic and relatively stable clock timer counter that ticks off at a rate no more than an order of magnitude or so faster than the rate of hardclock_ticks (fast enough that it's precision is "high enough"). I tried using hardclock_ticks in sys/syscall_stats.h, but even with HZ00, the resolution was too fine compared to the time taken by the some system calls, and even the average time slice for user mode. It is _way_ better than statclock ticks though, and "almost" free. It might be the best default method to use, with something like binuptime() or cpu_counter32() as possible for more accurate but more costly measurements. Going back to the cpu_counter32() variant (e.g. using RDTSC on Intel), and with some other tweaks to the time allotment algorithm, and now that I've cleaned up several other little oddities in my changes, the results are more interesting, but still somehow wonky in ways I can't explain. I've attached my new syscall_stats.h and my modified kern_resource.c for reference. There are a few other minor changes necessary in other files -- if anyone actually wants to try this themselves I can try to pick them out and post them too. I've taken the most significant 24 bits of information from the cpu_counter32() sums and used them to divide out the p_rtime total between user and system and I'm still seeing anomalies: (note the test program, the tcountbits.c I posed earlier, attached here again, was compiled with -O0 with the native netbsd-5 compiler -- I should look at the generated assembler to be sure it matches my expectations, but I haven't done that yet) First off, the user time for bare time() calls seems excessively high. It does almost nothing in user mode. The user time for nulltime() should be the overhead of wrapping the time() calls in yet another function call, yet here we see it faster than time() itself (pun intended? maybe!). Interestingly countbits_dense() shows a similar anomaly, and when we look at the system time for it and for nulltime() each we see it's somewhat higher than expected -- indeed system time should be _identical_ for each test! Somehow the distribution of p_uticks and p_sticks seems off in these two test cases. I cannot explain this. Keep in mind this is on a VirtualBox VM (with 1 CPU configured) running on my desktop dual-core iMac. As an aside note that compared to runs without SYSCALL_PROCTIMES this use of cpu_counter32() seems to be almost as expensive as using binuptime() to directly measure the time spent in each mode. Maybe cpu_counter32() is faster on real hardware? What if cpu_counter32() had a CPUID instruction inserted as per the Intel recommendations, or is this necessary when it's in a function call? $ /usr/bin/time -l /root/tmp/tcountbits -t -i 10 tcountbits: using CLOCK_MONOTONIC timer with resolution: 0 s, 279 ns tcountbits: now running each algorithm for 10000000 iterations.... time() = 6.3639 us/c user, 9.2655 us/c sys, 0.0073 us/c wait, 15.6367 us/c wall nulltime() = 5.8032 us/c user, 10.3264 us/c sys, 0.0070 us/c wait, 16.1366 us/c wall countbits_sparse() = 6.1739 us/c user, 9.9215 us/c sys, 0.0333 us/c wait, 16.1287 us/c wall countbits_dense() = 5.6236 us/c user, 10.2241 us/c sys, 0.0076 us/c wait, 15.8553 us/c wall COUNT_BITS() = 6.4009 us/c user, 9.6854 us/c sys, 0.0060 us/c wait, 16.0922 us/c wall count_bits() = 6.4157 us/c user, 9.5008 us/c sys, 0.0072 us/c wait, 15.9237 us/c wall count_ul_bits() = 6.5759 us/c user, 9.1104 us/c sys, 0.0381 us/c wait, 15.7243 us/c wall 1115.00 real 433.64 user 680.60 sys 20268 maximum resident set size 7 average shared memory size 624 average unshared data size 11 average unshared stack size 59 page reclaims 0 page faults 0 swaps 0 block input operations 0 block output operations 0 messages sent 0 messages received 0 signals received -6 voluntary context switches 1999 involuntary context switches This is getting rather hard to read, but below you see the calcru() results for each getrusage() call in my test, as well as the final one from the exit of the test program. Here "msb" is the highest common significant bit set in the p_{u,s}ticks values (which are the sum of cpu_counter32() deltas during respective user and system execution time); The first "us" (microseconds) value is the p_rtime sum. The numbers in square brackets are the respective user and system cpu_counter32() values, first the raw uint64_t value, then the result of right-shifting by (msb-24) IFF msb is greater than 24. The adjusted numbers are then used to calculate the distribution of user and system time from p_rtime with: tot = stn + utn + itn; stus = (rtus * stn) / tot; utus = (rtus * utn) / tot; (the p_iticks values obviously turn into noise for now) getrusage: tcountbits[368]: 2002 us {msb!} (u27 us [3361834:3361834], s?4 us [2609217:2609217], i=0 us [0:0]) getrusage: tcountbits[368]: 156295933 us {msb8} (uc639664 us [474670110703:28971564], s?656244 us [691096516145:42181183], i$ us [11:11]) getrusage: tcountbits[368]: 156351493 us {msb8} (uc653160 us [474671312094:28971637], s?698308 us [691265389092:42191491], i$ us [11:11]) getrusage: tcountbits[368]: 317647009 us {msb9} (u1684914 us [900743552766:27488511], s5962010 us [1450562013490:44267639], i? us [19:19]) getrusage: tcountbits[368]: 317694508 us {msb9} (u1695649 us [900744011979:27488525], s5998774 us [1450706927538:44272061], i? us [19:19]) getrusage: tcountbits[368]: 478648621 us {msb@} (u3434476 us [1353383060021:20650986], s)5213886 us [2178093657574:33235071], i%7 us [29:29]) getrusage: tcountbits[368]: 478687514 us {msb@} (u3443222 us [1353383247687:20650989], s)5244033 us [2178212513372:33236885], i%7 us [29:29]) getrusage: tcountbits[368]: 637164825 us {msb@} (u#9679681 us [1748498991972:26679977], s97484811 us [2899710953519:44246077], i32 us [37:37]) getrusage: tcountbits[368]: 637218711 us {msb@} (u#9691463 us [1748499138716:26679979], s97526915 us [2899875756167:44248592], i32 us [37:37]) getrusage: tcountbits[368]: 798081612 us {msbA} (u03700144 us [2212920241381:16883241], sI4380657 us [3602319462730:27483516], i?9 us [45:45]) getrusage: tcountbits[368]: 798116956 us {msbA} (u03708162 us [2212921841727:16883253], sI4407983 us [3602426032870:27484329], i?9 us [45:45]) getrusage: tcountbits[368]: 957282508 us {msbA} (u67865411 us [2689388073356:20518402], sX9416146 us [4309099709708:32875821], i?0 us [53:53]) getrusage: tcountbits[368]: 957334249 us {msbA} (u67876996 us [2689388226866:20518403], sX9456301 us [4309257773083:32877027], i?0 us [53:53]) getrusage: tcountbits[368]: 1114197060 us {msbA} (uC3636115 us [3178199198852:24247735], sh0559835 us [4987948728046:38055028], i08 us [62:62]) exit|tty: tcountbits[368]: 1114256566 us {msbA} (uC3649747 us [3178199411255:24247737], sh0605709 us [4988128542691:38056400], i08 us [62:62]) finally as a representative sample of some sort, here are the calcru() results from the cron jobs that ran while my test program was running: exit|tty: atrun[372]: 2386 us {msb!} (u72 us [2906384:2906384], s13 us [3010251:3010251], i=0 us [0:0]) exit|tty: sh[369]: 4183 us {msb!} (u&59 us [7089797:7089797], s23 us [4063198:4063198], i=0 us [0:0]) exit|tty: cron[366]: 2322 us {msb } (u 55 us [10443497:10443497], s&6 us [1352571:1352571], i=0 us [0:0]) exit|tty: newsyslog[361]: 5204 us {msb"} (u%81 us [7578918:7578918], s&22 us [7700252:7700252], i=0 us [0:0]) exit|tty: sh[370]: 5829 us {msb"} (u&65 us [7345858:7345858], s163 us [8719226:8719226], i=0 us [0:0]) exit|tty: cron[356]: 2144 us {msb } (u36 us [5669431:5669431], s@7 us [1329078:1329078], i=0 us [0:0]) exit|tty: newsyslog[373]: 4218 us {msb"} (u77 us [5575281:5575281], s#40 us [6948941:6948941], i=0 us [0:0]) exit|tty: sh[394]: 4469 us {msb"} (u'46 us [7330159:7330159], s22 us [4597957:4597957], i=0 us [0:0]) exit|tty: cron[365]: 1800 us {msb } (u19 us [4460357:4460357], s80 us [1197219:1197219], i=0 us [0:0]) exit|tty: atrun[391]: 4005 us {msb"} (u$86 us [7169577:7169577], s18 us [4379431:4379431], i=0 us [0:0]) exit|tty: sh[98]: 4053 us {msb"} (u#61 us [6331740:6331740], s91 us [4536196:4536196], i=0 us [0:0]) exit|tty: cron[377]: 1787 us {msb } (u86 us [9173955:9173955], s 0 us [1157063:1157063], i=0 us [0:0]) -- Greg A. Woods Planix, Inc. <woods%planix.com@localhost> +1 250 762-7675 http://www.planix.com/
/* $NetBSD: syscall_stats.h,v 1.3 2008/04/29 06:53:03 martin Exp $ */ /*- * Copyright (c) 2007 The NetBSD Foundation, Inc. * All rights reserved. * * This code is derived from software contributed to The NetBSD Foundation * by David Laight. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * * THIS SOFTWARE IS PROVIDED BY THE NETBSD FOUNDATION, INC. AND CONTRIBUTORS * ``AS IS'' AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED * TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR * PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE FOUNDATION OR CONTRIBUTORS * BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR * CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF * SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS * INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN * CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) * ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE * POSSIBILITY OF SUCH DAMAGE. */ #ifndef _SYS_SYCALL_STAT_H_ #define _SYS_SYCALL_STAT_H_ #include "opt_syscall_stats.h" /* * keep track of how many times each system call is called for the whole system */ #ifdef SYSCALL_COUNTS # include <sys/syscall.h> extern uint64_t syscall_counts[SYS_NSYSENT]; /* * used in syscall() entry to accumulate counts for each system call * * "table" must always be "syscall_counts", * "code" is the system call number */ # define SYSCALL_COUNT(table, code) ((table)[code]++) #else # define SYSCALL_COUNT(table, code) /* nothing */ #endif /* SYSCALL_COUNTS */ /* * keep track of overall CPU time used per system call for the whole system */ #if defined(SYSCALL_TIMES) # include <machine/cpu_counter.h> extern uint64_t syscall_times[SYS_NSYSENT]; # define SYSCALL_TIME_SYS_SET(l, table, code) \ (l)->l_syscall_tp = (table) + (code) # define SYSCALL_TIME_UPDATE(l, delta) \ *(l)->l_syscall_tp += delta #else /* !SYSCALL_TIMES */ /* * Avoid some unnecessary 64-bit additions if we're not using SYSCALL_TIMES */ # define SYSCALL_TIME_SYS_SET(l, table, code) /* nothing */ # define SYSCALL_TIME_UPDATE(l, delta) /* nothing */ #endif /* SYSCALL_TIMES */ #if defined(SYSCALL_TIMES) || defined(SYSCALL_PROCTIMES) /* * These are effectively state enumerators, but are also stand-ins for a * location in syscall_times in lwp->l_syscall_tp so must be of the same type. */ extern uint64_t syscall_time_user, syscall_time_system, syscall_time_interrupt; #ifdef SYSCALL_STATS_USE_UPTIME /* * ideally we would just store the "raw" struct bintime, but for now since we * have some relevant u_quad_t slots in struct proc to store the accumulated * times in, we'll convert to nanoseconds so we can use them */ static __inline__ __attribute__((__always_inline__)) uint32_t __syscall_time(void) { uint64_t ns; struct bintime bt; getbinuptime(&bt); /* binuptime() is too expensive */ ns = (bt.sec * 1000000000LL) + (uint64_t) (((uint64_t) 1000000000LL * (uint32_t) (bt.frac >> 32)) >> 32); return (ns & UINT32_MAX); /* XXX mask unnecessary? */ } #endif #ifdef SYSCALL_STATS_USE_HARDCLOCK # include <sys/kernel.h> # define __syscall_time() hardclock_ticks #endif #define SYSCALL_STATS_USE_CPUTSC /* defined */ #ifdef SYSCALL_STATS_USE_CPUTSC # ifndef __HAVE_CPU_COUNTER # error "Use of CPU timestamp counter for SYSCALL_STATS invalid: no cpu_counter()" # endif # include <machine/cpu_counter.h> # ifdef SYSCALL_TIMES_HASCOUNTER /* Force use of cycle counter - needed for Soekris systems */ # define __syscall_time() (cpu_counter32()) # else # define __syscall_time() (cpu_hascounter() ? cpu_counter32() : 0u) # endif #endif # ifdef SYSCALL_PROCTIMES /* * keep track of process user and system time accurately */ # define SYSCALL_TIME_UPDATE_PROC(l, fld, delta) \ (l)->l_proc->p_##fld##ticks += (delta) # else # define SYSCALL_TIME_UPDATE_PROC(l, fld, delta) # endif /* * Process wakeup * * Used in mi_switch() as new lwp is about to run -- mark the start time of its * next time slice */ # define SYSCALL_TIME_WAKEUP(l) \ (l)->l_syscall_time = __syscall_time() /* * lwp creation * * XXX is "syscall_time_system" the correct "state" here? */ # define SYSCALL_TIME_LWP_INIT(l) do { \ (l)->l_syscall_tp = &syscall_time_system; \ SYSCALL_TIME_WAKEUP(l); \ } while (0) /* * System call entry hook * * used at beginning of syscall() * * "table" must always be "syscall_times", * "code" is the system call number * * All time spent between the last time l->l_syscall_time was set and now was * spent executing user code (except for when ISRs were running). */ # define SYSCALL_TIME_SYS_ENTRY(l, table, code) do { \ uint32_t now = __syscall_time(); \ uint32_t elapsed = now - (l)->l_syscall_time; \ SYSCALL_TIME_UPDATE_PROC(l, u, elapsed); \ SYSCALL_TIME_SYS_SET(l, table, code); \ (l)->l_syscall_time = now; \ } while (0) /* * process yeilding from in kernel mode * * used in mi_switch() if a new lwp will execute, as well as in * lwp_exit_switchaway() * * All time spent between the last time l->l_syscall_time was set and now was * spent executing in kernel mode on behalf of l's process. * * Now yeilding to another lwp. */ # define SYSCALL_TIME_SLEEP(l) do { \ uint32_t now = __syscall_time(); \ uint32_t elapsed = now - (l)->l_syscall_time; \ SYSCALL_TIME_UPDATE_PROC(l, s, elapsed); \ SYSCALL_TIME_UPDATE(l, elapsed); \ (l)->l_syscall_time = now; \ } while (0) /* * System call completion * * Used at the end of syscall() just before returning to user mode. * * All time spent between the last time l->l_syscall_time was set and now was * spent executing in kernel mode on behalf of l's process. */ # define SYSCALL_TIME_SYS_EXIT(l) do { \ uint32_t now = __syscall_time(); \ uint32_t elapsed = now - (l)->l_syscall_time; \ SYSCALL_TIME_UPDATE_PROC(l, s, elapsed); \ SYSCALL_TIME_UPDATE(l, elapsed); \ (l)->l_syscall_time = now; \ (l)->l_syscall_tp = &syscall_time_user; \ } while (0) /* * Interrupt entry hook * * "old" is storage for a "uint64_t *" to be passed as "saved" to * SYSCALL_TIME_ISR_EXIT() */ # define SYSCALL_TIME_ISR_ENTRY(l, old) do { \ uint32_t now = __syscall_time(); \ uint32_t elapsed = now - (l)->l_syscall_time; \ (l)->l_syscall_time = now; \ old = (l)->l_syscall_tp; \ if ((l)->l_syscall_tp != &syscall_time_interrupt) \ if ((l)->l_syscall_tp == &syscall_time_user) \ SYSCALL_TIME_UPDATE_PROC(l, u, elapsed); \ else { \ SYSCALL_TIME_UPDATE(l, elapsed); \ SYSCALL_TIME_UPDATE_PROC(l, s, elapsed); \ } \ (l)->l_syscall_counter = &syscall_count_interrupt; \ } \ } while (0) /* * Interrupt exit hook * * "saved" is the "unit64_t *" pointer passed as "old" to * SYSCALL_TIME_ISR_ENTRY() */ # define SYSCALL_TIME_ISR_EXIT(l, saved) do { \ uint32_t now = __syscall_time(); \ uint32_t elapsed = now - (l)->l_syscall_time; \ SYSCALL_TIME_UPDATE_PROC(l, i, elapsed); \ (l)->l_syscall_time = now; \ (l)->l_syscall_tp = saved; \ } while (0) #endif /* SYSCALL_TIMES || SYSCALL_PROCTIMES */ #ifndef SYSCALL_TIME_LWP_INIT # define SYSCALL_TIME_LWP_INIT(l) # define SYSCALL_TIME_SYS_ENTRY(l, table, code) # define SYSCALL_TIME_SLEEP(l) # define SYSCALL_TIME_WAKEUP(l) # define SYSCALL_TIME_SYS_EXIT(l) # define SYSCALL_TIME_ISR_ENTRY(l, old) # define SYSCALL_TIME_ISR_EXIT(l, saved) # undef SYSCALL_TIME_SYS_SET # undef SYSCALL_TIME_UPDATE #endif #endif /* !_SYS_SYCALL_STAT_H_ */
/* $NetBSD: kern_resource.c,v 1.147.4.2 2009/08/14 21:15:16 snj Exp $ */ /*- * Copyright (c) 1982, 1986, 1991, 1993 * The Regents of the University of California. All rights reserved. * (c) UNIX System Laboratories, Inc. * All or some portions of this file are derived from material licensed * to the University of California by American Telephone and Telegraph * Co. or Unix System Laboratories, Inc. and are reproduced herein with * the permission of UNIX System Laboratories, Inc. * * Redistribution and use in source and binary forms, with or without * modification, are permitted provided that the following conditions * are met: * 1. Redistributions of source code must retain the above copyright * notice, this list of conditions and the following disclaimer. * 2. Redistributions in binary form must reproduce the above copyright * notice, this list of conditions and the following disclaimer in the * documentation and/or other materials provided with the distribution. * 3. Neither the name of the University nor the names of its contributors * may be used to endorse or promote products derived from this software * without specific prior written permission. * * THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND * ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE * IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE * ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE * FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL * DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS * OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) * HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT * LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY * OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF * SUCH DAMAGE. * * @(#)kern_resource.c 8.8 (Berkeley) 2/14/95 */ #include <sys/cdefs.h> __KERNEL_RCSID(0, "$NetBSD: kern_resource.c,v 1.147.4.2 2009/08/14 21:15:16 snj Exp $"); #include "opt_syscall_stats.h" #include <sys/param.h> #include <sys/systm.h> #include <sys/kernel.h> #include <sys/file.h> #include <sys/resourcevar.h> #include <sys/malloc.h> #include <sys/kmem.h> #include <sys/namei.h> #include <sys/pool.h> #include <sys/proc.h> #include <sys/sysctl.h> #include <sys/timevar.h> #include <sys/kauth.h> #include <sys/atomic.h> #include <sys/mount.h> #include <sys/syscallargs.h> #include <sys/atomic.h> #include <uvm/uvm_extern.h> /* * Maximum process data and stack limits. * They are variables so they are patchable. */ rlim_t maxdmap = MAXDSIZ; rlim_t maxsmap = MAXSSIZ; static pool_cache_t plimit_cache; static pool_cache_t pstats_cache; void resource_init(void) { plimit_cache = pool_cache_init(sizeof(struct plimit), 0, 0, 0, "plimitpl", NULL, IPL_NONE, NULL, NULL, NULL); pstats_cache = pool_cache_init(sizeof(struct pstats), 0, 0, 0, "pstatspl", NULL, IPL_NONE, NULL, NULL, NULL); } /* * Resource controls and accounting. */ int sys_getpriority(struct lwp *l, const struct sys_getpriority_args *uap, register_t *retval) { /* { syscallarg(int) which; syscallarg(id_t) who; } */ struct proc *curp = l->l_proc, *p; int low = NZERO + PRIO_MAX + 1; int who = SCARG(uap, who); mutex_enter(proc_lock); switch (SCARG(uap, which)) { case PRIO_PROCESS: if (who == 0) p = curp; else p = p_find(who, PFIND_LOCKED); if (p != NULL) low = p->p_nice; break; case PRIO_PGRP: { struct pgrp *pg; if (who == 0) pg = curp->p_pgrp; else if ((pg = pg_find(who, PFIND_LOCKED)) == NULL) break; LIST_FOREACH(p, &pg->pg_members, p_pglist) { if (p->p_nice < low) low = p->p_nice; } break; } case PRIO_USER: if (who == 0) who = (int)kauth_cred_geteuid(l->l_cred); PROCLIST_FOREACH(p, &allproc) { if ((p->p_flag & PK_MARKER) != 0) continue; mutex_enter(p->p_lock); if (kauth_cred_geteuid(p->p_cred) == (uid_t)who && p->p_nice < low) low = p->p_nice; mutex_exit(p->p_lock); } break; default: mutex_exit(proc_lock); return (EINVAL); } mutex_exit(proc_lock); if (low == NZERO + PRIO_MAX + 1) return (ESRCH); *retval = low - NZERO; return (0); } /* ARGSUSED */ int sys_setpriority(struct lwp *l, const struct sys_setpriority_args *uap, register_t *retval) { /* { syscallarg(int) which; syscallarg(id_t) who; syscallarg(int) prio; } */ struct proc *curp = l->l_proc, *p; int found = 0, error = 0; int who = SCARG(uap, who); mutex_enter(proc_lock); switch (SCARG(uap, which)) { case PRIO_PROCESS: if (who == 0) p = curp; else p = p_find(who, PFIND_LOCKED); if (p != 0) { mutex_enter(p->p_lock); error = donice(l, p, SCARG(uap, prio)); mutex_exit(p->p_lock); found++; } break; case PRIO_PGRP: { struct pgrp *pg; if (who == 0) pg = curp->p_pgrp; else if ((pg = pg_find(who, PFIND_LOCKED)) == NULL) break; LIST_FOREACH(p, &pg->pg_members, p_pglist) { mutex_enter(p->p_lock); error = donice(l, p, SCARG(uap, prio)); mutex_exit(p->p_lock); found++; } break; } case PRIO_USER: if (who == 0) who = (int)kauth_cred_geteuid(l->l_cred); PROCLIST_FOREACH(p, &allproc) { if ((p->p_flag & PK_MARKER) != 0) continue; mutex_enter(p->p_lock); if (kauth_cred_geteuid(p->p_cred) == (uid_t)SCARG(uap, who)) { error = donice(l, p, SCARG(uap, prio)); found++; } mutex_exit(p->p_lock); } break; default: mutex_exit(proc_lock); return EINVAL; } mutex_exit(proc_lock); if (found == 0) return (ESRCH); return (error); } /* * Renice a process. * * Call with the target process' credentials locked. */ int donice(struct lwp *l, struct proc *chgp, int n) { kauth_cred_t cred = l->l_cred; KASSERT(mutex_owned(chgp->p_lock)); if (kauth_cred_geteuid(cred) && kauth_cred_getuid(cred) && kauth_cred_geteuid(cred) != kauth_cred_geteuid(chgp->p_cred) && kauth_cred_getuid(cred) != kauth_cred_geteuid(chgp->p_cred)) return (EPERM); if (n > PRIO_MAX) n = PRIO_MAX; if (n < PRIO_MIN) n = PRIO_MIN; n += NZERO; if (kauth_authorize_process(cred, KAUTH_PROCESS_NICE, chgp, KAUTH_ARG(n), NULL, NULL)) return (EACCES); sched_nice(chgp, n); return (0); } /* ARGSUSED */ int sys_setrlimit(struct lwp *l, const struct sys_setrlimit_args *uap, register_t *retval) { /* { syscallarg(int) which; syscallarg(const struct rlimit *) rlp; } */ int which = SCARG(uap, which); struct rlimit alim; int error; error = copyin(SCARG(uap, rlp), &alim, sizeof(struct rlimit)); if (error) return (error); return (dosetrlimit(l, l->l_proc, which, &alim)); } int dosetrlimit(struct lwp *l, struct proc *p, int which, struct rlimit *limp) { struct rlimit *alimp; int error; if ((u_int)which >= RLIM_NLIMITS) return (EINVAL); if (limp->rlim_cur < 0 || limp->rlim_max < 0) return (EINVAL); if (limp->rlim_cur > limp->rlim_max) { /* * This is programming error. According to SUSv2, we should * return error in this case. */ return (EINVAL); } alimp = &p->p_rlimit[which]; /* if we don't change the value, no need to limcopy() */ if (limp->rlim_cur == alimp->rlim_cur && limp->rlim_max == alimp->rlim_max) return 0; error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_RLIMIT, p, KAUTH_ARG(KAUTH_REQ_PROCESS_RLIMIT_SET), limp, KAUTH_ARG(which)); if (error) return (error); lim_privatise(p, false); /* p->p_limit is now unchangeable */ alimp = &p->p_rlimit[which]; switch (which) { case RLIMIT_DATA: if (limp->rlim_cur > maxdmap) limp->rlim_cur = maxdmap; if (limp->rlim_max > maxdmap) limp->rlim_max = maxdmap; break; case RLIMIT_STACK: if (limp->rlim_cur > maxsmap) limp->rlim_cur = maxsmap; if (limp->rlim_max > maxsmap) limp->rlim_max = maxsmap; /* * Return EINVAL if the new stack size limit is lower than * current usage. Otherwise, the process would get SIGSEGV the * moment it would try to access anything on it's current stack. * This conforms to SUSv2. */ if (limp->rlim_cur < p->p_vmspace->vm_ssize * PAGE_SIZE || limp->rlim_max < p->p_vmspace->vm_ssize * PAGE_SIZE) { return (EINVAL); } /* * Stack is allocated to the max at exec time with * only "rlim_cur" bytes accessible (In other words, * allocates stack dividing two contiguous regions at * "rlim_cur" bytes boundary). * * Since allocation is done in terms of page, roundup * "rlim_cur" (otherwise, contiguous regions * overlap). If stack limit is going up make more * accessible, if going down make inaccessible. */ limp->rlim_cur = round_page(limp->rlim_cur); if (limp->rlim_cur != alimp->rlim_cur) { vaddr_t addr; vsize_t size; vm_prot_t prot; if (limp->rlim_cur > alimp->rlim_cur) { prot = VM_PROT_READ | VM_PROT_WRITE; size = limp->rlim_cur - alimp->rlim_cur; addr = (vaddr_t)p->p_vmspace->vm_minsaddr - limp->rlim_cur; } else { prot = VM_PROT_NONE; size = alimp->rlim_cur - limp->rlim_cur; addr = (vaddr_t)p->p_vmspace->vm_minsaddr - alimp->rlim_cur; } (void) uvm_map_protect(&p->p_vmspace->vm_map, addr, addr+size, prot, false); } break; case RLIMIT_NOFILE: if (limp->rlim_cur > maxfiles) limp->rlim_cur = maxfiles; if (limp->rlim_max > maxfiles) limp->rlim_max = maxfiles; break; case RLIMIT_NPROC: if (limp->rlim_cur > maxproc) limp->rlim_cur = maxproc; if (limp->rlim_max > maxproc) limp->rlim_max = maxproc; break; } mutex_enter(&p->p_limit->pl_lock); *alimp = *limp; mutex_exit(&p->p_limit->pl_lock); return (0); } /* ARGSUSED */ int sys_getrlimit(struct lwp *l, const struct sys_getrlimit_args *uap, register_t *retval) { /* { syscallarg(int) which; syscallarg(struct rlimit *) rlp; } */ struct proc *p = l->l_proc; int which = SCARG(uap, which); struct rlimit rl; if ((u_int)which >= RLIM_NLIMITS) return (EINVAL); mutex_enter(p->p_lock); memcpy(&rl, &p->p_rlimit[which], sizeof(rl)); mutex_exit(p->p_lock); return copyout(&rl, SCARG(uap, rlp), sizeof(rl)); } /* * find the most significant bit set in an integer * * aka the "log base 2" of an integer */ static unsigned int msb_uint64(uint64_t); static unsigned int msb_uint64(uint64_t v) { unsigned int mb = 0; while (v >>= 1) { /* unroll for more speed... */ mb++; } return mb; } /* * Transform the running time and tick information in proc p into user, * system, and interrupt time usage. * * Should be called with p->p_lock held unless called from exit1(). */ void calcru(struct proc *p, struct timeval *up, struct timeval *sp, struct timeval *ip, struct timeval *rp) { uint64_t st, ut, it, tot; /* "ticks" */ uint64_t stn, utn, itn; /* normalized "ticks" */ uint64_t rtus, stus, utus, itus; /* microseconds */ unsigned int hcMSBit; struct lwp *l; struct bintime rbtm; struct timeval rtv; mutex_spin_enter(&p->p_stmutex); st = p->p_sticks; /* statclock hits in system mode */ ut = p->p_uticks; /* statclock hits in user mode */ it = p->p_iticks; /* statclock hits during interrupt(s) */ mutex_spin_exit(&p->p_stmutex); /* * XXX we only really need tm, or tv, if we're returning it via rp, but * we'll calculate it for now for analysis */ rbtm = p->p_rtime; /* hardclock real time for any exited lwps??? */ LIST_FOREACH(l, &p->p_lwps, l_sibling) { lwp_lock(l); bintime_add(&rbtm, &l->l_rtime); if ((l->l_pflag & LP_RUNNING) != 0) { struct bintime diff; /* * Adjust for the current time slice. This is * actually fairly important since the error * here is on the order of a time quantum, * which is much greater than the sampling * error. */ binuptime(&diff); /* "uptime" */ bintime_sub(&diff, &l->l_stime); /* - "switchtime" */ bintime_add(&rbtm, &diff); } lwp_unlock(l); } bintime2timeval(&rbtm, &rtv); rtus = (uint64_t) rtv.tv_sec * 1000000ul + rtv.tv_usec; /* tot real tm usec */ /* not likely necessary with cpu_counter32() tickers! */ if (ut == 0) ut++; /* 0.5 or less would be more fair! :-) */ /* else */ if (st == 0) st++; /* 0.5 or less would be more fair! :-) */ #if 1 /* ndef SYSCALL_PROCTIMES */ /* * distribute total real microseconds of runtime accumulated by this * process by each type of "ticks" accounted against it * * XXX using statclock ticks, this can go bad! * * Total real time for all the lwps that are part of this "process" are * calculated with true elapsed time in context switching, while * statclock() only samples which lwp was executing and in what mode at * stathz intervals. If statclock() has not yet accounted the "next * tick" to this process yet the amount of time apportioned to each of * user, system, and interrupt time will be larger than it should be, * and the next time we do something to calculate the split two of * those times may appear to have gone backwards. * * FreeBSD saves the results of the calculation in the proc structure * and then clamps subsequent results to be greater or equal to the * previous results. * * Perhaps NetBSD could do the same, using p_stats->p_ru? * * A better solution would be to measure time spent in system calls * using the same clock as is used for l_rtime. That would only leave * interrupt time as the wild-card being estimated by statclock ticks. * Maybe interrupt time could also be measured by the interrupt * dispatcher and then accounted to the thread which it interrupted? * * Hmmm... in <sys/syscall_stats.h> we have SYSCALL_PROCTIMES * * it records deltas of cpu_counter32() values on some platforms */ # if 0 if (stathz != 0 && stathz < hz) it *= hz / stathz; /* adjust "it" from stathz to hz */ # endif /* * "normalize" the tick counters to some reasonable magnitude * * Pick the highest common MSBit of each, then preserve the top-most 24 * bits by shifting each number right by (hcMSBit-24). * * XXX except for the still-statclock-based p_iticks value. */ utn = ut; stn = st; itn = it; hcMSBit = min(msb_uint64(ut), msb_uint64(st)); if (hcMSBit > 24) { utn = ut >> (hcMSBit - 24); stn = st >> (hcMSBit - 24); } tot = stn + utn + itn; /* total "ticks" */ stus = (rtus * stn) / tot; utus = (rtus * utn) / tot; itus = (rtus * itn) / tot; # ifdef DIAGNOSTIC printf("%s: %s[%ld]: %llu us {msb=%u} (u=%llu us [%llu:%llu], s=%llu us [%llu:%llu], i=%llu us [%llu:%llu])\n", (ip && rp) ? "acct_process" : (ip && !rp) ? "getrusage" : (!ip && rp) ? "sysctl" : "exit|tty", p->p_comm, (long) p->p_pid, rtus, hcMSBit, utus, ut, utn, stus, st, stn, itus, it, itn); # endif if (sp != NULL) { sp->tv_sec = stus / 1000000; sp->tv_usec = stus % 1000000; } if (up != NULL) { up->tv_sec = utus / 1000000; up->tv_usec = utus % 1000000; } if (ip != NULL) { ip->tv_sec = itus / 1000000; ip->tv_usec = itus % 1000000; } #else /* * We have _measured_ user, system (XXX and maybe interrupt) times! * * These are not stathz ticks, but nanoseconds (at getbinuptime() res) */ # ifdef DIAGNOSTIC printf("%s: %s[%ld]: rt=%llu us, u+s=%llu us (u=%lld ns / s=%lld ns), it=%llu ticks\n", (ip && rp) ? "acct_process" : (ip && !rp) ? "getrusage" : (!ip && rp) ? "sysctl" : "exit|tty", p->p_comm, (long) p->p_pid, u, (ut/1000) + (st/1000), ut, st, it); # endif if (sp != NULL) { sp->tv_sec = st / 1000000000; sp->tv_usec = (st % 1000000000) / 1000; } if (up != NULL) { up->tv_sec = ut / 1000000000; up->tv_usec = (ut % 1000000000) / 1000 ; } if (ip != NULL) { ip->tv_sec = 0; /* it / 1000000000 */ ip->tv_usec = 0; /* (it % 100000000) / 1000 */ } #endif if (rp != NULL) { *rp = rtv; } } /* ARGSUSED */ int sys_getrusage(struct lwp *l, const struct sys_getrusage_args *uap, register_t *retval) { /* { syscallarg(int) who; syscallarg(struct rusage *) rusage; } */ struct rusage ru; struct proc *p = l->l_proc; struct timeval dummy; switch (SCARG(uap, who)) { case RUSAGE_SELF: mutex_enter(p->p_lock); memcpy(&ru, &p->p_stats->p_ru, sizeof(ru)); calcru(p, &ru.ru_utime, &ru.ru_stime, &dummy, NULL); rulwps(p, &ru); mutex_exit(p->p_lock); break; case RUSAGE_CHILDREN: mutex_enter(p->p_lock); memcpy(&ru, &p->p_stats->p_cru, sizeof(ru)); mutex_exit(p->p_lock); break; default: return EINVAL; } return copyout(&ru, SCARG(uap, rusage), sizeof(ru)); } void ruadd(struct rusage *ru, struct rusage *ru2) { long *ip, *ip2; int i; timeradd(&ru->ru_utime, &ru2->ru_utime, &ru->ru_utime); timeradd(&ru->ru_stime, &ru2->ru_stime, &ru->ru_stime); if (ru->ru_maxrss < ru2->ru_maxrss) ru->ru_maxrss = ru2->ru_maxrss; ip = &ru->ru_first; ip2 = &ru2->ru_first; for (i = &ru->ru_last - &ru->ru_first; i >= 0; i--) *ip++ += *ip2++; } void rulwps(proc_t *p, struct rusage *ru) { lwp_t *l; KASSERT(mutex_owned(p->p_lock)); LIST_FOREACH(l, &p->p_lwps, l_sibling) { ruadd(ru, &l->l_ru); ru->ru_nvcsw += (l->l_ncsw - l->l_nivcsw); /* XXX ru_nvcsw can go negative! */ ru->ru_nivcsw += l->l_nivcsw; } } /* * Make a copy of the plimit structure. * We share these structures copy-on-write after fork, * and copy when a limit is changed. * * Unfortunately (due to PL_SHAREMOD) it is possibly for the structure * we are copying to change beneath our feet! */ struct plimit * lim_copy(struct plimit *lim) { struct plimit *newlim; char *corename; size_t alen, len; newlim = pool_cache_get(plimit_cache, PR_WAITOK); mutex_init(&newlim->pl_lock, MUTEX_DEFAULT, IPL_NONE); newlim->pl_flags = 0; newlim->pl_refcnt = 1; newlim->pl_sv_limit = NULL; mutex_enter(&lim->pl_lock); memcpy(newlim->pl_rlimit, lim->pl_rlimit, sizeof(struct rlimit) * RLIM_NLIMITS); alen = 0; corename = NULL; for (;;) { if (lim->pl_corename == defcorename) { newlim->pl_corename = defcorename; break; } len = strlen(lim->pl_corename) + 1; if (len <= alen) { newlim->pl_corename = corename; memcpy(corename, lim->pl_corename, len); corename = NULL; break; } mutex_exit(&lim->pl_lock); if (corename != NULL) free(corename, M_TEMP); alen = len; corename = malloc(alen, M_TEMP, M_WAITOK); mutex_enter(&lim->pl_lock); } mutex_exit(&lim->pl_lock); if (corename != NULL) free(corename, M_TEMP); return newlim; } void lim_addref(struct plimit *lim) { atomic_inc_uint(&lim->pl_refcnt); } /* * Give a process it's own private plimit structure. * This will only be shared (in fork) if modifications are to be shared. */ void lim_privatise(struct proc *p, bool set_shared) { struct plimit *lim, *newlim; lim = p->p_limit; if (lim->pl_flags & PL_WRITEABLE) { if (set_shared) lim->pl_flags |= PL_SHAREMOD; return; } if (set_shared && lim->pl_flags & PL_SHAREMOD) return; newlim = lim_copy(lim); mutex_enter(p->p_lock); if (p->p_limit->pl_flags & PL_WRITEABLE) { /* Someone crept in while we were busy */ mutex_exit(p->p_lock); limfree(newlim); if (set_shared) p->p_limit->pl_flags |= PL_SHAREMOD; return; } /* * Since most accesses to p->p_limit aren't locked, we must not * delete the old limit structure yet. */ newlim->pl_sv_limit = p->p_limit; newlim->pl_flags |= PL_WRITEABLE; if (set_shared) newlim->pl_flags |= PL_SHAREMOD; p->p_limit = newlim; mutex_exit(p->p_lock); } void limfree(struct plimit *lim) { struct plimit *sv_lim; do { if (atomic_dec_uint_nv(&lim->pl_refcnt) > 0) return; if (lim->pl_corename != defcorename) free(lim->pl_corename, M_TEMP); sv_lim = lim->pl_sv_limit; mutex_destroy(&lim->pl_lock); pool_cache_put(plimit_cache, lim); } while ((lim = sv_lim) != NULL); } struct pstats * pstatscopy(struct pstats *ps) { struct pstats *newps; newps = pool_cache_get(pstats_cache, PR_WAITOK); memset(&newps->pstat_startzero, 0, (unsigned) ((char *)&newps->pstat_endzero - (char *)&newps->pstat_startzero)); memcpy(&newps->pstat_startcopy, &ps->pstat_startcopy, ((char *)&newps->pstat_endcopy - (char *)&newps->pstat_startcopy)); return (newps); } void pstatsfree(struct pstats *ps) { pool_cache_put(pstats_cache, ps); } /* * sysctl interface in five parts */ /* * a routine for sysctl proc subtree helpers that need to pick a valid * process by pid. */ static int sysctl_proc_findproc(struct lwp *l, struct proc **p2, pid_t pid) { struct proc *ptmp; int error = 0; if (pid == PROC_CURPROC) ptmp = l->l_proc; else if ((ptmp = pfind(pid)) == NULL) error = ESRCH; *p2 = ptmp; return (error); } /* * sysctl helper routine for setting a process's specific corefile * name. picks the process based on the given pid and checks the * correctness of the new value. */ static int sysctl_proc_corename(SYSCTLFN_ARGS) { struct proc *ptmp; struct plimit *lim; int error = 0, len; char *cname; char *ocore; char *tmp; struct sysctlnode node; /* * is this all correct? */ if (namelen != 0) return (EINVAL); if (name[-1] != PROC_PID_CORENAME) return (EINVAL); /* * whom are we tweaking? */ error = sysctl_proc_findproc(l, &ptmp, (pid_t)name[-2]); if (error) return (error); /* XXX-elad */ error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_CANSEE, ptmp, KAUTH_ARG(KAUTH_REQ_PROCESS_CANSEE_ENTRY), NULL, NULL); if (error) return (error); if (newp == NULL) { error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_CORENAME, ptmp, KAUTH_ARG(KAUTH_REQ_PROCESS_CORENAME_GET), NULL, NULL); if (error) return (error); } /* * let them modify a temporary copy of the core name */ cname = PNBUF_GET(); lim = ptmp->p_limit; mutex_enter(&lim->pl_lock); strlcpy(cname, lim->pl_corename, MAXPATHLEN); mutex_exit(&lim->pl_lock); node = *rnode; node.sysctl_data = cname; error = sysctl_lookup(SYSCTLFN_CALL(&node)); /* * if that failed, or they have nothing new to say, or we've * heard it before... */ if (error || newp == NULL) goto done; lim = ptmp->p_limit; mutex_enter(&lim->pl_lock); error = strcmp(cname, lim->pl_corename); mutex_exit(&lim->pl_lock); if (error == 0) /* Unchanged */ goto done; error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_CORENAME, ptmp, KAUTH_ARG(KAUTH_REQ_PROCESS_CORENAME_SET), cname, NULL); if (error) return (error); /* * no error yet and cname now has the new core name in it. * let's see if it looks acceptable. it must be either "core" * or end in ".core" or "/core". */ len = strlen(cname); if (len < 4) { error = EINVAL; } else if (strcmp(cname + len - 4, "core") != 0) { error = EINVAL; } else if (len > 4 && cname[len - 5] != '/' && cname[len - 5] != '.') { error = EINVAL; } if (error != 0) { goto done; } /* * hmm...looks good. now...where do we put it? */ tmp = malloc(len + 1, M_TEMP, M_WAITOK|M_CANFAIL); if (tmp == NULL) { error = ENOMEM; goto done; } memcpy(tmp, cname, len + 1); lim_privatise(ptmp, false); lim = ptmp->p_limit; mutex_enter(&lim->pl_lock); ocore = lim->pl_corename; lim->pl_corename = tmp; mutex_exit(&lim->pl_lock); if (ocore != defcorename) free(ocore, M_TEMP); done: PNBUF_PUT(cname); return error; } /* * sysctl helper routine for checking/setting a process's stop flags, * one for fork and one for exec. */ static int sysctl_proc_stop(SYSCTLFN_ARGS) { struct proc *ptmp; int i, f, error = 0; struct sysctlnode node; if (namelen != 0) return (EINVAL); error = sysctl_proc_findproc(l, &ptmp, (pid_t)name[-2]); if (error) return (error); /* XXX-elad */ error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_CANSEE, ptmp, KAUTH_ARG(KAUTH_REQ_PROCESS_CANSEE_ENTRY), NULL, NULL); if (error) return (error); switch (rnode->sysctl_num) { case PROC_PID_STOPFORK: f = PS_STOPFORK; break; case PROC_PID_STOPEXEC: f = PS_STOPEXEC; break; case PROC_PID_STOPEXIT: f = PS_STOPEXIT; break; default: return (EINVAL); } i = (ptmp->p_flag & f) ? 1 : 0; node = *rnode; node.sysctl_data = &i; error = sysctl_lookup(SYSCTLFN_CALL(&node)); if (error || newp == NULL) return (error); mutex_enter(ptmp->p_lock); error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_STOPFLAG, ptmp, KAUTH_ARG(f), NULL, NULL); if (!error) { if (i) { ptmp->p_sflag |= f; } else { ptmp->p_sflag &= ~f; } } mutex_exit(ptmp->p_lock); return error; } /* * sysctl helper routine for a process's rlimits as exposed by sysctl. */ static int sysctl_proc_plimit(SYSCTLFN_ARGS) { struct proc *ptmp; u_int limitno; int which, error = 0; struct rlimit alim; struct sysctlnode node; if (namelen != 0) return (EINVAL); which = name[-1]; if (which != PROC_PID_LIMIT_TYPE_SOFT && which != PROC_PID_LIMIT_TYPE_HARD) return (EINVAL); limitno = name[-2] - 1; if (limitno >= RLIM_NLIMITS) return (EINVAL); if (name[-3] != PROC_PID_LIMIT) return (EINVAL); error = sysctl_proc_findproc(l, &ptmp, (pid_t)name[-4]); if (error) return (error); /* XXX-elad */ error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_CANSEE, ptmp, KAUTH_ARG(KAUTH_REQ_PROCESS_CANSEE_ENTRY), NULL, NULL); if (error) return (error); /* Check if we can view limits. */ if (newp == NULL) { error = kauth_authorize_process(l->l_cred, KAUTH_PROCESS_RLIMIT, ptmp, KAUTH_ARG(KAUTH_REQ_PROCESS_RLIMIT_GET), &alim, KAUTH_ARG(which)); if (error) return (error); } node = *rnode; memcpy(&alim, &ptmp->p_rlimit[limitno], sizeof(alim)); if (which == PROC_PID_LIMIT_TYPE_HARD) node.sysctl_data = &alim.rlim_max; else node.sysctl_data = &alim.rlim_cur; error = sysctl_lookup(SYSCTLFN_CALL(&node)); if (error || newp == NULL) return (error); return (dosetrlimit(l, ptmp, limitno, &alim)); } /* * and finally, the actually glue that sticks it to the tree */ SYSCTL_SETUP(sysctl_proc_setup, "sysctl proc subtree setup") { sysctl_createv(clog, 0, NULL, NULL, CTLFLAG_PERMANENT, CTLTYPE_NODE, "proc", NULL, NULL, 0, NULL, 0, CTL_PROC, CTL_EOL); sysctl_createv(clog, 0, NULL, NULL, CTLFLAG_PERMANENT|CTLFLAG_ANYNUMBER, CTLTYPE_NODE, "curproc", SYSCTL_DESCR("Per-process settings"), NULL, 0, NULL, 0, CTL_PROC, PROC_CURPROC, CTL_EOL); sysctl_createv(clog, 0, NULL, NULL, CTLFLAG_PERMANENT|CTLFLAG_READWRITE|CTLFLAG_ANYWRITE, CTLTYPE_STRING, "corename", SYSCTL_DESCR("Core file name"), sysctl_proc_corename, 0, NULL, MAXPATHLEN, CTL_PROC, PROC_CURPROC, PROC_PID_CORENAME, CTL_EOL); sysctl_createv(clog, 0, NULL, NULL, CTLFLAG_PERMANENT, CTLTYPE_NODE, "rlimit", SYSCTL_DESCR("Process limits"), NULL, 0, NULL, 0, CTL_PROC, PROC_CURPROC, PROC_PID_LIMIT, CTL_EOL); #define create_proc_plimit(s, n) do { \ sysctl_createv(clog, 0, NULL, NULL, \ CTLFLAG_PERMANENT, \ CTLTYPE_NODE, s, \ SYSCTL_DESCR("Process " s " limits"), \ NULL, 0, NULL, 0, \ CTL_PROC, PROC_CURPROC, PROC_PID_LIMIT, n, \ CTL_EOL); \ sysctl_createv(clog, 0, NULL, NULL, \ CTLFLAG_PERMANENT|CTLFLAG_READWRITE|CTLFLAG_ANYWRITE, \ CTLTYPE_QUAD, "soft", \ SYSCTL_DESCR("Process soft " s " limit"), \ sysctl_proc_plimit, 0, NULL, 0, \ CTL_PROC, PROC_CURPROC, PROC_PID_LIMIT, n, \ PROC_PID_LIMIT_TYPE_SOFT, CTL_EOL); \ sysctl_createv(clog, 0, NULL, NULL, \ CTLFLAG_PERMANENT|CTLFLAG_READWRITE|CTLFLAG_ANYWRITE, \ CTLTYPE_QUAD, "hard", \ SYSCTL_DESCR("Process hard " s " limit"), \ sysctl_proc_plimit, 0, NULL, 0, \ CTL_PROC, PROC_CURPROC, PROC_PID_LIMIT, n, \ PROC_PID_LIMIT_TYPE_HARD, CTL_EOL); \ } while (0/*CONSTCOND*/) create_proc_plimit("cputime", PROC_PID_LIMIT_CPU); create_proc_plimit("filesize", PROC_PID_LIMIT_FSIZE); create_proc_plimit("datasize", PROC_PID_LIMIT_DATA); create_proc_plimit("stacksize", PROC_PID_LIMIT_STACK); create_proc_plimit("coredumpsize", PROC_PID_LIMIT_CORE); create_proc_plimit("memoryuse", PROC_PID_LIMIT_RSS); create_proc_plimit("memorylocked", PROC_PID_LIMIT_MEMLOCK); create_proc_plimit("maxproc", PROC_PID_LIMIT_NPROC); create_proc_plimit("descriptors", PROC_PID_LIMIT_NOFILE); create_proc_plimit("sbsize", PROC_PID_LIMIT_SBSIZE); create_proc_plimit("vmemoryuse", PROC_PID_LIMIT_AS); #undef create_proc_plimit sysctl_createv(clog, 0, NULL, NULL, CTLFLAG_PERMANENT|CTLFLAG_READWRITE|CTLFLAG_ANYWRITE, CTLTYPE_INT, "stopfork", SYSCTL_DESCR("Stop process at fork(2)"), sysctl_proc_stop, 0, NULL, 0, CTL_PROC, PROC_CURPROC, PROC_PID_STOPFORK, CTL_EOL); sysctl_createv(clog, 0, NULL, NULL, CTLFLAG_PERMANENT|CTLFLAG_READWRITE|CTLFLAG_ANYWRITE, CTLTYPE_INT, "stopexec", SYSCTL_DESCR("Stop process at execve(2)"), sysctl_proc_stop, 0, NULL, 0, CTL_PROC, PROC_CURPROC, PROC_PID_STOPEXEC, CTL_EOL); sysctl_createv(clog, 0, NULL, NULL, CTLFLAG_PERMANENT|CTLFLAG_READWRITE|CTLFLAG_ANYWRITE, CTLTYPE_INT, "stopexit", SYSCTL_DESCR("Stop process before completing exit"), sysctl_proc_stop, 0, NULL, 0, CTL_PROC, PROC_CURPROC, PROC_PID_STOPEXIT, CTL_EOL); }
#include <sys/cdefs.h> #include <sys/types.h> #include <sys/resource.h> #include <sys/time.h> #include <err.h> #include <errno.h> #include <limits.h> #include <stdbool.h> #include <stdio.h> #include <stdlib.h> #include <string.h> #include <sysexits.h> #include <unistd.h> /* * WARNING: time can appear to have gone backwards with getrusage(2)! * * See NetBSD Problem Report #30115 (and PR#10201). * See FreeBSD Problem Report #975 (and PR#10402). * * Problem has existed in all *BSDs since 4.4BSD if not earlier. * * Only FreeBSD has implemented a "fix" (as of rev.1.45 (svn r44725) of * kern_resource.c (etc.) on April 13, 1999) * * But maybe it is even worse than that -- distribution of time between user * and system doesn't seem to match reality! * * See the GNU MP Library (GMP)'s tune/time.c code for better timing? */ #if defined(__APPLE__) # define MILLIONS 10 /* my mac is way faster! :-) */ #else # define MILLIONS 1 #endif static unsigned int iter = MILLIONS * 1000000UL; char *argv0 = "progname"; /* * for info about the worker algorithms used here see: * <URL:http://graphics.stanford.edu/~seander/bithacks.html> */ /* * do nothing much, but make sure you do it! */ unsigned int nullfunc(unsigned long); unsigned int nullfunc(unsigned long v) { volatile unsigned int bc = (unsigned int) v; return bc; } /* * return the number of bits set to one in a value * * old-fashioned bit-by-bit bit-twiddling.... very slow! */ unsigned int count_ul_bits(unsigned long); unsigned int count_ul_bits(unsigned long v) { unsigned int c; c = 0; /* * we optimize away any high-order zero'd bits... */ while (v) { c += (v & 1); v >>= 1; } return c; } /* * return the number of bits set to one in a value * * Subtraction of 1 from a number toggles all the bits (from right to left) up * to and including the righmost set bit. * * So, if we decrement a number by 1 and do a bitwise and (&) with itself * ((n-1) & n), we will clear the righmost set bit in the number. * * Therefore if we do this in a loop and count the number of iterations then we * get the count of set bits. * * Executes in O(n) operations where n is number bits set to one in a given * integer value. */ unsigned int countbits_sparse(unsigned long); unsigned int countbits_sparse(unsigned long v) { volatile unsigned int bc = 0; while (v) { v &= v - 1; /* clear the least significant bit set to one */ bc++; } return bc; } unsigned int countbits_dense(unsigned long); unsigned int countbits_dense(unsigned long v) { volatile unsigned int bc = sizeof(v) * CHAR_BIT; v ^= (unsigned long) -1; while (v) { v &= v - 1; /* clear the least significant bit set to one */ bc--; } return bc; } /* * most efficient non-lookup variant from the URL above.... */ #define COUNT_BITS(T, x, c) \ { \ T n = (x); \ \ n = n - ((n >> 1) & (T)~(T)0/3); \ n = (n & (T)~(T)0/15*3) + ((n >> 2) & (T)~(T)0/15*3); \ n = (n + (n >> 4)) & (T)~(T)0/255*15; \ c = (T)(n * ((T)~(T)0/255)) >> (sizeof(T) - 1) * CHAR_BIT; \ } \ unsigned int count_bits(unsigned long); unsigned int count_bits(unsigned long v) { volatile unsigned int c; COUNT_BITS(unsigned long, v, c) return c; } #define MAX_STRLEN_OCTAL(t) ((int) ((sizeof(t) * CHAR_BIT / 3) + 2)) /* XXX see also timevalsub() */ suseconds_t difftval(struct timeval, struct timeval); suseconds_t difftval(struct timeval tstart, struct timeval tend) { tend.tv_sec -= tstart.tv_sec; tend.tv_usec -= tstart.tv_usec; while (tend.tv_usec < 0) { tend.tv_sec--; tend.tv_usec += 1000000; } while (tend.tv_usec >= 1000000) { tend.tv_sec++; tend.tv_usec -= 1000000; } return (suseconds_t) ((tend.tv_sec * 1000000) + tend.tv_usec); } suseconds_t microtime(void); /* * microtime() - return number of microseconds since some epoch * * the particular epoch is irrelevant -- we just use the difference between two * of these samples taken sufficiently far appart enough that the resolution is * also relatively unimportant, though better than 1 second is expected.... */ /* * Timing anomalies * * time(1) uses gettimeofday() to show the "real" time, by which it means the * wall-clock time it took to run the process, including the time to do the * vfork() and execvp(), ignore some signals, and call wait4(). * * However currently on NetBSD we can see getrusage() report a total of system * plus user time of as much as 0.06 seconds longer than gettimeofay() says it * took for the whole thing! E.g.: * * $ /usr/bin/time -p false * real 0.00 * user 0.03 * sys 0.03 * * Furthermore gettimeofday() can wander, e.g. due to NTP, or worse. * * Use clock_gettime(CLOCK_MONOTONIC, tspec) instead if possible! */ #ifdef CLOCK_MONOTONIC suseconds_t microtime() { struct timespec tsnow; (void) clock_gettime(CLOCK_MONOTONIC, &tsnow); return (suseconds_t) ((tsnow.tv_sec * 1000000) + (tsnow.tv_nsec / 1000)); } #else /* !CLOCK_MONOTONIC */ /* * XXX this is currently for Darwin/Mac OS X, which does not implement the * POSIX (IEEE Std 1003.1b-1993) clock_gettime() API * * Note that on OS X the gettimeofday() function is implemented in libc as a * wrapper to either the _commpage_gettimeofday() function, if available, or * the normal system call. If using the COMMPAGE helper then gettimeofday() * simply returns the value stored in the COMMPAGE and thus can execute without * a context switch. */ suseconds_t microtime() { struct timeval tvnow; (void) gettimeofday(&tvnow, (void *) NULL); return (suseconds_t) ((tvnow.tv_sec * 1000000) + tvnow.tv_usec); } #endif /* CLOCK_MONOTONIC */ void show_time(char *, suseconds_t, suseconds_t, suseconds_t); void show_time(char *fname, suseconds_t us_u, suseconds_t us_s, suseconds_t us_c) { suseconds_t us_w = (us_c - (us_s + us_u)); double pc_u = (double) us_u / (double) iter ; double pc_s = (double) us_s / (double) iter ; double pc_w = (double) us_w / (double) iter ; double pc_c = (double) us_c / (double) iter ; /* * note in the calculation of us_w above that wall clock elapsed time * (us_c) is expected to be longer than the sum of user (us_u) and * system (us_s) time, and we will display the difference as "wait" * time, suggesting the amount of time the process was waiting for the * CPU (shown here per call) */ printf("%18s = %5.4f us/c user, %7.4f us/c sys, %5.4f us/c wait, %7.4f us/c wall\n", fname, pc_u, pc_s, pc_w, pc_c); } void usage(void) __dead; void usage() { fprintf(stderr, "Usage: %s [-t] [-i millions_of_iterations]\n", argv0); fprintf(stderr, "-t: don't run the verbose proof of concept test -- just do timing runs"); fprintf(stderr, "(default iterations: %lu * 10^6)\n", iter / 1000000UL); exit(EX_USAGE); } bool dotest = true; extern char *optarg; extern int optind; extern int optopt; extern int opterr; extern int optreset; int main(int, char *[]); int main(int argc, char *argv[]) { int ch; size_t i; struct rusage rus; struct rusage ruf; #ifdef CLOCK_MONOTONIC struct timespec res; #endif suseconds_t nulltime_u; suseconds_t nulltime_s; suseconds_t nulltime_w; suseconds_t timetime_u; suseconds_t timetime_s; suseconds_t timetime_w; suseconds_t totaltime_u; suseconds_t totaltime_s; suseconds_t totaltime_w; argv0 = (argv0 = strrchr(argv[0], '/')) ? argv0 + 1 : argv[0]; optind = 1; /* Start options parsing */ opterr = 0; /* I'll print my own errors! */ while ((ch = getopt(argc, argv, ":hi:t")) != -1) { long ltmp; char *ep; switch (ch) { case 'i': /* * extremely pedantic parameter evaluation */ errno = 0; ltmp = strtol(optarg, &ep, 0); if (ep == optarg) { err(EX_USAGE, "-%c param of '%s' is not a valid number", optopt, optarg); } if (*ep) { err(EX_USAGE, "-%c param of '%s' has unsupported trailing unit specification characters", optopt, optarg); } if (errno != 0) { err(EX_USAGE, "-%c param of '%s' is not convertible: %s", optopt, optarg, strerror(errno)); } if (ltmp > INT_MAX) { err(EX_USAGE, "-%c param of %ld is too large (must be <= %d)", optopt, ltmp, INT_MAX); } if (ltmp < 1) { err(EX_USAGE, "-%c param of %ld is too small (must be > 0)", optopt, ltmp); } iter = (unsigned int) ltmp * 1000000UL; break; case 't': dotest = false; break; case 'h': usage(); case '?': warnx("unknown option -- '%c'", optopt); usage(); case ':': /* * NOTE: a leading ':' in optstring causes getopt() to * return a ':' when an option is missing its parameter. */ warnx("missing parameter for -%c", optopt); usage(); default: warnx("programming error, unhandled flag: %c", ch); abort(); } } argc -= optind; argv += optind; if (argc) { usage(); } /* show that they all work.... */ for (i = (dotest ? 1 : INT_MAX); i < (sizeof(unsigned long) * CHAR_BIT); i++) { unsigned long v = 1UL << i; unsigned int c; COUNT_BITS(unsigned long, v/2+1, c) printf("%#-*lo (v/2+1) = %d, %d, %d, %d\n", MAX_STRLEN_OCTAL(typeof(v)), v/2+1, countbits_sparse(v/2+1), countbits_dense(v/2+1), count_ul_bits(v/2+1), c); COUNT_BITS(unsigned long, v-2, c) printf("%#-*lo (v-2) = %d, %d, %d, %d\n", MAX_STRLEN_OCTAL(typeof(v)), v-2, countbits_sparse(v-2), countbits_dense(v-2), count_ul_bits(v-2), c); COUNT_BITS(unsigned long, v-1, c) printf("%#-*lo (v-1) = %d, %d, %d, %d\n", MAX_STRLEN_OCTAL(typeof(v)), v-1, countbits_sparse(v-1), countbits_dense(v-1), count_ul_bits(v-1), c); COUNT_BITS(unsigned long, v, c) printf("%#-*lo (%2d bits) = %d, %d, %d, %d\n", MAX_STRLEN_OCTAL(typeof(v)), v, (int) i, countbits_sparse(v), countbits_dense(v), count_ul_bits(v), c); puts("--------"); } #ifdef CLOCK_MONOTONIC /* XXX "#ifdef CLOCK_PROCESS_CPUTIME_ID"??? */ if (clock_getres(CLOCK_MONOTONIC, &res) == -1) { err(EXIT_FAILURE, "clock_getres(CLOCK_MONOTONIC)"); } warnx("using CLOCK_MONOTONIC timer with resolution: %ld s, %ld ns", res.tv_sec, res.tv_nsec); #endif warnx("now running each algorithm for %u iterations....", iter); /* * We will see from below that on NetBSD (and others except maybe * FreeBSD) getrusage() can report _RADICALLY_ different amounts of * both _user_ time and system time for the exact same code to run! * * This is over and above the occasional appearance of one or the other * of these times appearing to go backwards (since they are both * calculated by dividing the amount of total run time between them * based on the number statclock tick "hits" which occured during that * runtime and based on whether the hit was taking while in kernel mode * or in user mode (or interrupt mode). */ #define START_CLOCKS(v) do { \ v##_w = microtime(); \ getrusage(RUSAGE_SELF, &rus); \ } while (0) #define STOP_CLOCKS(v) do { \ getrusage(RUSAGE_SELF, &ruf); \ v##_w = microtime() - v##_w; \ v##_u = difftval(rus.ru_utime, ruf.ru_utime); \ v##_s = difftval(rus.ru_stime, ruf.ru_stime); \ } while (0) /* time time() */ START_CLOCKS(timetime); for (i = 0; i < iter; i++) { (void) time((time_t *) NULL); } STOP_CLOCKS(timetime); show_time("time()", timetime_u, timetime_s, timetime_w); /* time nullfunc() */ START_CLOCKS(nulltime); for (i = 0; i < iter; i++) { nullfunc((unsigned long) time((time_t *) NULL)); } STOP_CLOCKS(nulltime); show_time("nulltime()", nulltime_u, nulltime_s, nulltime_w); /* note: leave nulltime_* as sum of nullfunc() time and time() time */ /* time countbits_sparse() */ START_CLOCKS(totaltime); for (i = 0; i < iter; i++) { countbits_sparse((unsigned long) time((time_t *) NULL)); } STOP_CLOCKS(totaltime); show_time("countbits_sparse()", totaltime_u, totaltime_s, totaltime_w); /* time countbits_dense() */ START_CLOCKS(totaltime); for (i = 0; i < iter; i++) { countbits_dense((unsigned long) time((time_t *) NULL)); } STOP_CLOCKS(totaltime); show_time("countbits_dense()", totaltime_u, totaltime_s, totaltime_w); /* time COUNT_BITS() */ START_CLOCKS(totaltime); for (i = 0; i < iter; i++) { unsigned int c; COUNT_BITS(unsigned long, (unsigned long) time((time_t *) NULL), c) } STOP_CLOCKS(totaltime); show_time("COUNT_BITS()", totaltime_u, totaltime_s, totaltime_w); /* time count_bits() */ START_CLOCKS(totaltime); for (i = 0; i < iter; i++) { count_bits((unsigned long) time((time_t *) NULL)); } STOP_CLOCKS(totaltime); show_time("count_bits()", totaltime_u, totaltime_s, totaltime_w); /* time count_ul_bits() */ START_CLOCKS(totaltime); for (i = 0; i < iter; i++) { count_ul_bits((unsigned long) time((time_t *) NULL)); } STOP_CLOCKS(totaltime); show_time("count_ul_bits()", totaltime_u, totaltime_s, totaltime_w); exit(0); }
Attachment:
pgpvsui8unPR7.pgp
Description: PGP signature