linux10.txt

sarnold welcome to the fourth day of umeet presentations :) sarnold today's first speaker is william lee irwin, III wli So for general background, generalized pid hashing is supposed to improve the performance of in-kernel algorithms so they scale better with the number of tasks. sarnold (please direct questions and comments to #qc; thanks) joined #linux wli In general, larger systems tend to run things like forking servers or run many different servers (possibly from different users) simultaneously. wli Almost all of the 2.5.x alterations done for this deal directly with very traditional computational complexity issues from basic algorithmic analysis. #linux wli So to begin looking further in-depth, let's examine some 2.4.x code that was seen to have issues. wli (not just ones directly addressed by pidhashing, but ones that involve searches over all tasks) wli (1) count_active_tasks() wli This is trying to compute the load average. wli In 2.4.x, the body of this function is: wli for_each_task(p) { wli if ((p->state == TASK_RUNNING || wli (p->state & TASK_UNINTERRUPTIBLE))) wli nr += FIXED_1; wli } wli The notion is simple: count up the number of tasks that are either TASK_RUNNING or TASK_UNINTERRUPTIBLE wli the 2.5.x code does the same thing, but implements it using a different principle: wli "Keep count, and when asked, just hand back the number." wli <grimoire:#qc> this may be silly, but what does FIXED_1 define as? wli FIXED_1 is a constant used for fixed point arithmetic. wli It represents 1. wli There is a tradeoff in this strategy: changing the state has new accounting overhead, but examining the state is much cheaper. wli The new accounting overhead takes the form of incrementing and decrementing counters and moving things between lists and so on. wli count_active_tasks() is actually somewhat instrumental in another way: it has been seen to be a reliability problem with increased clock rates on SMP servers (or at least I've triggered the problem in testing). wli This is because the timer tick happens often, but the time spent in it depends on the number of tasks wli So when it takes long enough to scan all tasks, you experience something called "timeslice overrun". wli This kind of problem is not unique to that function. wli (2) sys_setpgid() wli This is not a frequently exercised function, or at least not usually considered performance critical, but the 2.4.x implementation shows some of the tradeoffs. wli If I may back up by a small margin: wli <jmgv:#qc> but when do you update the task counter? i suposed each task add 1 when it born and sub 1 before kill.. isn't it? wli In general, alterations to task->state are done to pass a sort of implicit argument to the scheduler. wli So basically, the scheduler increments and decrements the counter for TASK_RUNNING when things sleep or wake, and increments and decrements a second counter for TASK_UNINTERRUPTIBLE when things sleep and wake. In principle a single couner could be used, but TASK_RUNNING wants to be counted in isolation. wli This is 2.5.x's nr_uninterruptible() and rq->nr_uninterrupt ible wli So back to (2) wli The critical part of this to examine is: wli if (pgid != pid) { wli struct task_struct * tmp; wli for_each_task (tmp) { wli if (tmp->pgrp == pgid && wli tmp->session == current->session) wli goto ok_pgid; wli } wli goto out; wli At this point we see that there are 2 checks happening: wli (A) checking for someone already having a given process group ID wli (B) checking for someone being in the same session wli It makes one think "Wouldn't it be nice if there were some other way to find them?" wli In 2.5.x things are a bit different wli The old infrastructure in 2.4.x has a hashtable only for task->pid wli In 2.5.x now there are hashtables for task->pid, task->tgid, task->pgrp, and task->session wli That same code looks like: wli for_each_task_pid(pgid, PIDTYPE_PGID, p, l, pid) wli if (p->session == current->session) wli goto ok_pgid; wli <seletz:#qc> maybe a dump question: What do those hash tables contain? key, value i mean. wli Okay this question will help explain what goes on in that code. wli There is a "struct pid" representing a given "process id" wli The notion of "process id" is a generalization, so it doesn't just mean PID, it includes PID, PGID, TGID, and SID. wli The struct pid has inside of it a list of all the tasks sharing the same value of that ID. wli For instance, suppose there were 20 tasks in a a given process group. cored[net3port46.tricom.net]) wli The struct pid for that pgrp will show up in the hashtable's bucket lists, and you look at the struct pid and walk the list inside it. wli This should give you an idea of what's in for_each_task_pid() already. wli #define for_each_task_pid(who, type, task, elem, pid) \ wli if ((pid = find_pid(type, who))) \ wli for (elem = pid->task_list.next, \ wli prefetch(elem->next), \ wli task = pid_task(elem, type); \ wli elem != &pid->task_list; \ wli elem = elem->next, prefetch(elem->next), \ wli task = pid_task(elem, type)) wli pid = find_pid(type, who) /* find_pid() does hashtable lookup, and finds a struct pid that matches */ wli specifically, find_pid() takes as the first argument what kind of pid (PGID/SID/etc.), and as its second argument the value. wli This is used to start up the loop. wli The actual looping part looks at pid->task_list wli and is essentially a list_for_each_entry() wli You'll notice that in a struct pid there are 2 sets of list links: wli One for the hashtable and one for the list of tasks: this is to help deal with sharing. wli Now this is where the principle I mentioned about state changing overhead vs. state examination overhead comes into play: wli These hashtables have to be updated at fork(), exec(), and exit(). wli There is another class of side-effects: wli The locking changes in some (less important) cases. wli sys_setpgid() is one of those cases. wli In 2.4.x there was no structure associated with pgrp's that had to be altered when they change, so 2.4.x took the tasklist_lock for reading. wli In 2.5.x there is, so the tasklist_lock must be taken for writing. wli This is a different tradeoff: certain operations are now less parallelizable, though they are more efficient in serial. wli The general tendency with these operations that were serialized in 2.5.x is that they are "uncommon" operations, ones that aren't usually requested often or considered performance-critical, where the operations that were speeded up were either not seriously affected by accounting overhead, or drastically speeded up, and already performed in serial. wli There was another side-effect, and in fact, it was used to promote this infrastructure for the kernel: wli This accounting provides explicit notification of when a pid is no longer in use. wli At first glance, this doesn't seem terribly important, but there is a particular operation that would like to know. wli During fork(), a process needs to be assigned a unique ID. This is called "pid allocation". wli The 2.4.x pid allocator has a basic strategy of "scan the tasklist and check to see if the ID I guessed at the beginning is actually good". wli It's actually a somewhat refined, but not very deterministic algorithm. wli The 2.4.x pid allocator maintains a "window" of safe pid's, and moves, grows, and shrinks this window as it allocates pid's and scans the tasklist. wli In the "best" case, this results in O(tasks) performance, but in the worst, it's O(tasks**2). And it's actually provable that over a sequence of operations, things average out so that even though some times it may have taken O(tasks**2) time, it averages out to O(tasks). wli This is in general okay, though it's more questionable on SMP: wli the trouble is that its worst case can be severe enough when there are enough tasks that deadlock detection mechanisms can be set off, like the NMI watchdog, by competing things attempting to read_lock(&tasklist_lock). wli er wli the trouble is that its worst case can be severe enough when there are enough tasks that deadlock detection mechanisms can be set off, like the NMI watchdog, by competing things attempting to write_lock_irq(&tasklist_lock). wli The 2.5.x pid allocator uses the hashtable updates to help. wli void detach_pid(task_t *task, enum pid_type type) wli { wli int nr = __detach_pid(task, type); wli if (!nr) wli return; wli for (type = 0; type < PIDTYPE_MAX; ++type) wli if (find_pid(type, nr)) wli return; wli free_pidmap(nr); wli } wli The for () loop is the important part here. wli It is doing "for all types of process ID's, if we can find this ID number in use, return without doing anything" wli If the ID number is completely unused, it crosses it off the bitmap. wli This bitmap has an entry for each possible pid number, and if the bit is set, the ID number is in use, and if it's not set, the number is not in use. wli <sarnold:#qc> a 32k bitmap for pids? wli Actually no wli The PID space was also enlarged in order to be able to run more processes. wli The maximum allowable PID space is 4M large, and its size is controllable via sysctl wli There are actually 2 levels to the map: wli The first is a level pointing to pages with a count of free pid's within it. wli The second is a page whose bits are used to track individual pid's. wli <chbm:#qc> uint32 wli <chbm:#qc> why not use it all :) wli The only reason is that in general the space overhead of enough tasks to utilize a PID space that large renders actually using it all infeasible, at least on 32-bit. wli For 64-bit machines with much memory a sufficient number of tasks to utilize it may be feasible, but it isn't useful due to the overhead. wli It's large enough to accommodate the numbers of tasks being demanded by real workloads, which is usually not much more than 100K in the largest cases, and since the scanning is O(1) (I'll get back to that in a while), the size of the space is mostly arbitrary. wli You'll notice that in alloc_pidmap(): wli scan_more: wli offset = find_next_zero_bit(map->page, BITS_PER_PAGE, offset); wli if (offset >= BITS_PER_PAGE) wli goto next_map; wli if (test_and_set_bit(offset, map->page)) wli goto scan_more; wli This means that it is actually walking the bitmap and (in theory) taking O(sizeof(pidspace)); in practice the cache effects dominate, and the algorithm dominates direct tasklist scanning, and is expected O(1). wli With some additional indexing space overhead, it could be made worst-case O(1), but any gain in the average case is likely to be marginal. wli Future directions: wli The remaining tasklist scans aren't as performance critical as the ones above. wli In the core, there is an uncommon operation, which is to set the priority of all tasks belonging to a given user, and several error paths that scan for tasks having a given mm. wli Also global capability mask setting, and sending a signal to all tasks. wli The mm scanning for dumping core and OOM killing is not considered very useful to optimize (by me), and the uid scanning is for such a rare operation it's unclear what would benefit from the overhead of maintaining the list links. wli Some "miscellaneous" other cases also occur: wli netfilter's ipt_owner module scans the tasklist trying to find a task with a matching task->comm wli This actually can become performance critical for machines supporting network-intensive workloads using ipt_owner, but there is not quite enough infrastructure to deal with it possibly being a module. wli There is also a very badly-behaved quadratic tasklist scan in /proc/ to simulate reading the root /proc/ directory, where there is not yet an acceptable solution. wli One thing where there is a clear direction is task validation. wli There is a new mechanism for reference counting task structures so that some guarantee can be had of a task structure remaining valid. wli There are several parts of the kernel, notably vm86 and SBUS environment control drivers, that want to hold references to tasks, but the only method they (think) they have of verifying is comparing task structure pointers in a tasklist scan. wli This doesn't actually work, because the structures are slab-allocated and can be reused by different tasks. wli The solution is now to either hold a reference count on the task structure or to put in a hook in release_thread() wli These are clear bugfixes, so there's no doubt about this strategy being useful. wli There is an additional infrastructure that would potentially allow netfilter for instance to deal with the hashtable update problem: wli That is, registering functions to be called at fork(), exec(), exit(), and other useful task-related events. wli At the moment, there is not a wide variety of potential users, and given the unusual nature of the things that need it, there may never be. wli But it has the ability to provide the infrastructure for facilities like SBUS environment control, vm86, and netfilter to be modules yet still efficient. wli <cdub:#qc> do you still intend to provide a generic scanning function with a callback for comparison? wli My intention there was to actually very strictly control the users of such an interface and limit its use solely to preapproved catastrophic events in order to prevent reintroductions of tasklist scanning algorithms or generally inefficient code; I'm a little less worked up about it these days, and sort of doubt that kind of admission control would be popular. wli <cdub:#qc> and LSM has hooks in those spots that could do the updates. (similarly task ornaments would help, i think). wli I was actually trying to plug task ornaments for this kind of stuff. =) wli Reusing the LSM hooks for these events sounds very interesting, though it would potentially be difficult to justify putting something like ipt_owner's ->comm hashing needs into what's supposedly a security model... a sort of "rename game" might be useful to deal with that issue. wli Another future direction which is very likely to actually happen is some method of reducing the statically allocated space footprint of the pidhashes. wli And a further future direction would be to compare the somewhat more deterministic multilevel bitmap approaches to the current approach to determine whether the average performance in real workloads is negatively effected by the cache effects of always updating several levels to the point where the gains in determinism in the current algorithm's worst cases are not worth it. wli I'm not really planning on investigating that unless I get really bored; the worst cases have been impossible to trigger. wli Also note on 32-bit machines with PAE (which requires 3-level pagetables) the 2nd-level pagetables (pmd's) are always instantiated. wli This means 12KB of overhead per task. wli 32K tasks * 12KB == 768MB, so pmd's consume all of lowmem at that point; so for a very important case, the 32K space is already enough. wli Last, but not least, I'd like to thank Ingo Molnar, whose bugfixes, cleanups, and promotion of these methods got them into the kernel. wli Okay, thanks to everyone who showed up and listened, too. =) sarnold thanks wli :) > clap clap clap clap clap clap clap clap clap clap #linux Cannot send to channel sarnold thanks also to gareoda, translating to dutch in #taee... sarnold vizard will translate to spanish later on, for the wbsite... > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap gcc wli: thank for the talk.. fernand0 plas plas plas plas plas pals plas pals fernand0 plas plas plas plas plas pals plas pals fernand0 plas plas plas plas plas pals plas pals fernand0 plas plas plas plas plas pals plas pals Oroz thanks wli > clap clap clap clap clap clap clap clap clap clap zwane plas plas plas =) > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap erikm klap klap klap klap klap wli Anytime. I only hope it was interesting & informative enough to satisfy everyone. gcc clap clap clap clap clap clap clap clap clap clap > excellent !!! > thank you very much !!! > clap clap clap clap clap clap clap clap clap clap Ducky_Pha well done > clap clap clap clap clap clap clap clap clap clap apuigsech clep clap plas clep clap plas plas plas clep clep apuigsech clep clap plas clep clap plas plas plas clep clep apuigsech clep clap plas clep clap plas plas plas clep clep apuigsech :) > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap zwane wli is drowned in a sea of 'plas'es sarnold wli: so, a sysctl allows one to use upwards of 4M of unswappable memory for a pidbitmap? woah.. zwane sarnold: hehe wli 4M bits, that'd be 0.5MB sarnold ahhhh zwane wli: does -bk1-wli1 address this? wli No it does not. wli The enlarged space (> 32K) is actually useful, and it's relatively painful to allocate-on-demand. sarnold wli: would you think there to be any value in pids that don't wrap? wli sarnold: Mostly impossible to guarantee. Even 64-bit can wrap after sustained fork() & exit() activity. sarnold wli: not hoping to run 4 billion processes at once but as a way of uniquely identifying a process over the lifetime of the machine? wli You could use a 128-bit PID. zwane sarnold: you mean never recycle? sarnold zwane: yeah zwane nyark sarnold zwane: i'm thinking of it from an audit perspective or something... zwane sarnold: i can understand your pov wli Something slightly larger than 64-bit should actually get it above reasonable hardware lifetimes. sarnold zwane: with the possibility that knowing the pids will never wrap might be useful internally too zwane sarnold: pid+timestamp zwane or some ugly variation of wli sarnold: never recycling would fix a number of userspace races also sarnold heh, though converting userspace to handle >30 character pids might not be fun :) chbm if you say pid is now %Ld people are likely to shoot you :) wli Guaranteeing a long enough interval between pid wraps is useful; 32K is actually so small that larger systems can cycle it in a matter of seconds or less. chbm humm chbm pid starvation is trivial then ? fernand0 thank you very much for the talk wli Not starvation per-se. The pid space enlargement should handle that easily. Just making it so that resource scalability issues can't cause rapid wrapping due to high pidspace occupancy plus making it large enough the computation required to cycle through an entire wrap takes "a while", e.g. several minutes is enough to put a lot less stress on userspace wrt. races involving pid's as unique identifiers. wli Yeah, yeah, it's a kernel fix to a userspace problem, but fixing all the pid-as-unique-id races in userspace is infeasible from my POV. sarnold wli: you're probably right on that one :-/ sarnold wli: too many $$ in shell scripts... botijo what's up, pask apuigsech wli, one question about linux kernel x86 GDT table... sarnold garoeda: sent :) garoeda i know, i was late :-( (thought 19 utc thus 20 localtime but it was already in localtime on the conference list...i noticed it, but too late) apuigsech why is null (not used) the second entry? garoeda sarnold: ok :-) wli apuigsech: Ask away. wli I wonder if 90 minutes was about the right length. vizard it is over? :( wli yeah sarnold vizard: yup vizard damn job vizard wli, i´ll have to translate your talk later, i hope you din´t use too-technical words on it :) sarnold vizard: good luck :) > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap > clap clap clap clap clap clap clap clap clap clap Generated by irclog2html.pl by Jeff Waugh - find it at freshmeat.net!