Linux as an example On Linux systems, the load-average is not calculated on each clock tick, but driven by a variable value that is based on the frequency setting and tested on each clock tick. This setting defines the kernel clock tick rate in
hertz (times per second), and it defaults to 100 for ticks. Kernel activities use this number of ticks to time themselves. Specifically, the function (in loadavg.h, formerly sched.h), which calculates the load average, nominally runs every ticks, i.e. a tiny bit over . extern unsigned long avenrun[]; /* Load averages */ extern void get_avenrun(unsigned long *loads, unsigned long offset, int shift); • define FSHIFT 11 /* nr of bits of precision */ • define FIXED_1 (1= load) newload += FIXED_1-1; return newload / FIXED_1; } extern unsigned long calc_load_n(unsigned long load, unsigned long exp, unsigned long active, unsigned int n); • define LOAD_INT(x) ((x) >> FSHIFT) • define LOAD_FRAC(x) LOAD_INT(((x) & (FIXED_1-1)) * 100) The avenrun array contains 1-minute, 5-minute and 15-minute average. The function provides the correct update of the load-average for the default update rate of . void calc_global_load(void) { unsigned long sample_window; long active, delta; sample_window = READ_ONCE(calc_load_update); if (time_before(jiffies, sample_window + 10)) return; /* * Fold the 'old' NO_HZ-delta to include all NO_HZ CPUs. */ delta = calc_load_nohz_read(); if (delta) atomic_long_add(delta, &calc_load_tasks); active = atomic_long_read(&calc_load_tasks); active = active > 0 ? active * FIXED_1 : 0; avenrun[0] = calc_load(avenrun[0], EXP_1, active); avenrun[1] = calc_load(avenrun[1], EXP_5, active); avenrun[2] = calc_load(avenrun[2], EXP_15, active); WRITE_ONCE(calc_load_update, sample_window + LOAD_FREQ); /* * In case we went to NO_HZ for multiple LOAD_FREQ intervals * catch up in bulk. */ calc_global_nohz(); } Note this handling of NO_HZ CPUs. NO_HZ is a mode meant to reduce the number of scheduling-clock interrupts on idle processors, which improves power efficiency and reduces clock jitter. This would cause processors to miss their update ticks, however. As a result, the counting of tasks is done using atomic operations. The function handles the calculation in case of needing to catch up to multiple ticks, making use of this function: /* [1] application of the geometric series: * n 1 - x^(n+1) * S_n := \Sum x^i = ------------- * i=0 1 - x */ unsigned long calc_load_n(unsigned long load, unsigned long exp, unsigned long active, unsigned int n) { return calc_load(load, fixed_power_int(exp, FSHIFT, n), active); } Here (not included in article) raises exp to its nth power in fixed-point arithmetic.
Sampling and precision The "sampled" calculation of load averages is a somewhat common behavior;
FreeBSD, too, only refreshes the value every five seconds. The interval is usually taken to not be exact so that they do not collect processes that are scheduled to fire at a certain moment. This is the reason for "+1" in the Linux code from above; FreeBSD instead uses a pseudorandom offset added to the interval. also mentions that the use of 11 fractional bits in the above fixed point calculation prevents reducing the interval () much lower. For example, a two-second interval would result in the EXP values becoming 1981, 2034 and 2043, nearly saturating the precision available (0–2047). Ripke Klaus has shown in 2011 that the "+1" modification alone is not sufficient to avoid
Moiré artifacts from regularly-scheduled processes. His experiments suggest 4.61 to be a better value: 0.61 is close to the
golden ratio, which helps spread out the sample-point among fractional seconds. At the same time, 4.61 is close to , so the property of being an integer fraction of is maintained. would be more appropriate for varying values of HZ. The new values would be: • define LOAD_FREQ (60*HZ/13) /* 60/13 ~ 4.61 sec intervals */ • define EXP_1 1896 /* 1/exp(4.61sec/1min) = 1/exp(1/13) as fixed-point */ • define EXP_5 2017 /* 1/exp(4.61sec/5min) = 1/exp(1/13/5) */ • define EXP_15 2038 /* 1/exp(4.61sec/15min) = 1/exp(1/13/15) */
Calculation from userspace As we've discussed before, most kernels use
fixed-point arithmetic to calculate the load average for efficiency and simplicity. This however limits the achievable update-rate and precision. Given that we have established how to get the instantaneous load from userspace, it is also possible to calculate the load average from userspace. The following code does so in Python, using a
refresh rate of seconds, on time-spans ranging from 10 seconds to 1 hour: • !/usr/bin/env python3 import time import os from datetime import datetime from math import exp, log from dataclasses import dataclass LOGSYSPERIODS = [log(x) for x in [60, 300, 900 REFRESH_RATE = ((5**0.5) + 1) / 2 # Ripke's golden ratio PERIODS = [10, 30, 60, 120, 300, 900, 1800, 3600] COUNT_DISKWAIT = True # Whether to include disk wait in load calculation @dataclass class LoadEntry: avg: float exp: float def initialize_loads(now: int | float, lavgs: dict[int, LoadEntry]): """Initialize load averages based on linear interpolation of system load averages and instant load count.""" sys = getloadavg() slopes = ((sys[0] - now) / 60, (sys[1] - sys[0]) / 240, (sys[2] - sys[1]) / 600) for period in [10, 30, 60, 120, 300, 900, 1800, 3600]: exp_factor = exp(-REFRESH_RATE / period) if period None: for _, entry in lavgs.items(): entry.avg = entry.avg * entry.exp + current_load * (1 - entry.exp) if os.name == "posix": uname = os.uname()[0].lower() getloadavg = os.getloadavg if uname == "linux": def get_current_load() -> int: load = 0 with open("/proc/stat", "r") as f: for line in f: if line.startswith("procs_running "): load += int(line.split()[1]) # Read procs_running load -= 1 # Subtract one for ourself elif line.startswith("procs_blocked ") and COUNT_DISKWAIT: load += int(line.split()[1]) # Read procs_blocked return load else: PS_THREAD_OPTION = "-H" if os.uname()[0].lower().endswith("bsd") else "-M" PS_DISK_WAIT = "U" if os.uname()[0] == "Darwin" else "D" PS_STATES = ("R" + PS_DISK_WAIT) if COUNT_DISKWAIT else "R" def get_current_load() -> int: with os.popen(f"ps {PS_THREAD_OPTION}ax -o stat", "r") as f: states = map(f, lambda line: line.split()[-1]) # Get the last column. Required on macOS! return sum(1 if state[0] in PS_STATES else 0 for state in states) - 1 elif os.name == "nt": # It is possible to use Windows performance counters to get a queue length. # In fact, Microsoft recommends it as an additional metric for load in addition to CPU usage: # https://learn.microsoft.com/en-us/biztalk/technical-guides/using-the-performance-analysis-of-logs-pal-tool#processor-queue-length-analysis # Using it we can also obtain a load similar to Unix systems. from pyperfmon import pyperfmon pm = pyperfmon.pyperfmon() ncores = os.cpu_count() get_counter = lambda x: pm.getCounter(x) def get_current_load() -> float: return ( # Threads waiting for CPU, not running ones get_counter(r"System\Processor Queue Length") # Approximate number of threads using CPU + get_counter(r"Processor\_Total\% Processor Time") * ncores # Threads waiting for disk I/O + get_counter(r"PhysicalDisk\_Total\Current Disk Queue Length") if COUNT_DISKWAIT else 0 ) def getloadavg() -> tuple[float, float, float]: # Dummy implementation for Windows load = get_current_load() return (load, load, load) def main(): lavgs: dict[int, LoadEntry] = {} current_load = get_current_load() initialize_loads(current_load, lavgs) heading = ["SYSTIME", " CURR"] + [str(x) for x in PERIODS] print("\t".join(heading)) while True: # Always print before recalculating entries = [f"{datetime.now().strftime('%H:%M:%S')} {current_load:.4f}"] entries += [f"{entry.avg:.4f}" for entry in lavgs.values()] print("\t".join(entries), flush=True) # Sleep and wait. This assumes that the time used to update a new line is # minuscule compared to the time slept. If this is not the case, a # sleepuntil()-type reckoning should be used. time.sleep(REFRESH_RATE) # Print current_load = get_current_load() update_loads(lavgs, current_load) if __name__ == "__main__": main() == Other system performance commands ==