sarnold | welcome to the fourth day of umeet presentations :) |
sarnold | today's first speaker is william lee irwin, III |
wli | So for general background, generalized pid hashing is |
supposed to improve the performance of in-kernel algorithms so |
they scale better with the number of tasks. |
sarnold | (please direct questions and comments to #qc; |
thanks) |
joined #linux |
wli | In general, larger systems tend to run things like forking |
servers or run many different servers (possibly from different users) |
simultaneously. |
wli | Almost all of the 2.5.x alterations done for this deal |
directly with very traditional computational complexity issues from |
basic algorithmic analysis. |
#linux |
wli | So to begin looking further in-depth, let's examine some |
2.4.x code that was seen to have issues. |
wli | (not just ones directly addressed by pidhashing, but |
ones that involve searches over all tasks) |
wli | (1) count_active_tasks() |
wli | This is trying to compute the load average. |
wli | In 2.4.x, the body of this function is: |
wli | for_each_task(p) { |
wli | if ((p->state == TASK_RUNNING || |
wli | (p->state & TASK_UNINTERRUPTIBLE))) |
wli | nr += FIXED_1; |
wli | } |
wli | The notion is simple: count up the number of tasks that |
are either TASK_RUNNING or TASK_UNINTERRUPTIBLE |
wli | the 2.5.x code does the same thing, but implements it |
using a different principle: |
wli | "Keep count, and when asked, just hand back the |
number." |
wli | <grimoire:#qc> this may be silly, but what does |
FIXED_1 define as? |
wli | FIXED_1 is a constant used for fixed point arithmetic. |
wli | It represents 1. |
wli | There is a tradeoff in this strategy: changing the state |
has new accounting overhead, but examining the state is much |
cheaper. |
wli | The new accounting overhead takes the form of |
incrementing and decrementing counters and moving things |
between lists and so on. |
wli | count_active_tasks() is actually somewhat instrumental |
in another way: it has been seen to be a reliability problem with |
increased clock rates on SMP servers (or at least I've triggered the |
problem in testing). |
wli | This is because the timer tick happens often, but the |
time spent in it depends on the number of tasks |
wli | So when it takes long enough to scan all tasks, you |
experience something called "timeslice overrun". |
wli | This kind of problem is not unique to that function. |
wli | (2) sys_setpgid() |
wli | This is not a frequently exercised function, or at least not |
usually considered performance critical, but the 2.4.x |
implementation shows some of the tradeoffs. |
wli | If I may back up by a small margin: |
wli | <jmgv:#qc> but when do you update the task counter? i |
suposed each task add 1 when it born and sub 1 before kill.. isn't it? |
wli | In general, alterations to task->state are done to pass a |
sort of implicit argument to the scheduler. |
wli | So basically, the scheduler increments and decrements |
the counter for TASK_RUNNING when things sleep or wake, and |
increments and decrements a second counter for |
TASK_UNINTERRUPTIBLE when things sleep and wake. In |
principle a single couner could be used, but TASK_RUNNING |
wants to be counted in isolation. |
wli | This is 2.5.x's nr_uninterruptible() and rq->nr_uninterrupt |
ible |
wli | So back to (2) |
wli | The critical part of this to examine is: |
wli | if (pgid != pid) { |
wli | struct task_struct * tmp; |
wli | for_each_task (tmp) { |
wli | if (tmp->pgrp == pgid && |
wli | tmp->session == current->session) |
wli | goto ok_pgid; |
wli | } |
wli | goto out; |
wli | At this point we see that there are 2 checks happening: |
wli | (A) checking for someone already having a given |
process group ID |
wli | (B) checking for someone being in the same session |
wli | It makes one think "Wouldn't it be nice if there were |
some other way to find them?" |
wli | In 2.5.x things are a bit different |
wli | The old infrastructure in 2.4.x has a hashtable only for |
task->pid |
wli | In 2.5.x now there are hashtables for task->pid, |
task->tgid, task->pgrp, and task->session |
wli | That same code looks like: |
wli | for_each_task_pid(pgid, PIDTYPE_PGID, p, l, |
pid) |
wli | if (p->session == current->session) |
wli | goto ok_pgid; |
wli | <seletz:#qc> maybe a dump question: What do those |
hash tables contain? key, value i mean. |
wli | Okay this question will help explain what goes on in that |
code. |
wli | There is a "struct pid" representing a given "process id" |
wli | The notion of "process id" is a generalization, so it |
doesn't just mean PID, it includes PID, PGID, TGID, and SID. |
wli | The struct pid has inside of it a list of all the tasks sharing |
the same value of that ID. |
wli | For instance, suppose there were 20 tasks in a a given |
process group. |
cored[net3port46.tricom.net]) |
wli | The struct pid for that pgrp will show up in the |
hashtable's bucket lists, and you look at the struct pid and walk the |
list inside it. |
wli | This should give you an idea of what's in |
for_each_task_pid() already. |
wli | #define for_each_task_pid(who, type, task, elem, pid) |
\ |
wli | if ((pid = find_pid(type, who))) \ |
wli | for (elem = pid->task_list.next, \ |
wli | prefetch(elem->next), \ |
wli | task = pid_task(elem, type); \ |
wli | elem != &pid->task_list; \ |
wli | elem = elem->next, prefetch(elem->next), |
\ |
wli | task = pid_task(elem, type)) |
wli | pid = find_pid(type, who) /* find_pid() does hashtable |
lookup, and finds a struct pid that matches */ |
wli | specifically, find_pid() takes as the first argument what |
kind of pid (PGID/SID/etc.), and as its second argument the value. |
wli | This is used to start up the loop. |
wli | The actual looping part looks at pid->task_list |
wli | and is essentially a list_for_each_entry() |
wli | You'll notice that in a struct pid there are 2 sets of list |
links: |
wli | One for the hashtable and one for the list of tasks: this is |
to help deal with sharing. |
wli | Now this is where the principle I mentioned about state |
changing overhead vs. state examination overhead comes into play: |
wli | These hashtables have to be updated at fork(), exec(), |
and exit(). |
wli | There is another class of side-effects: |
wli | The locking changes in some (less important) cases. |
wli | sys_setpgid() is one of those cases. |
wli | In 2.4.x there was no structure associated with pgrp's |
that had to be altered when they change, so 2.4.x took the |
tasklist_lock for reading. |
wli | In 2.5.x there is, so the tasklist_lock must be taken for |
writing. |
wli | This is a different tradeoff: certain operations are now |
less parallelizable, though they are more efficient in serial. |
wli | The general tendency with these operations that were |
serialized in 2.5.x is that they are "uncommon" operations, ones that |
aren't usually requested often or considered performance-critical, |
where the operations that were speeded up were either not |
seriously affected by accounting overhead, or drastically speeded |
up, and already performed in serial. |
wli | There was another side-effect, and in fact, it was used to |
promote this infrastructure for the kernel: |
wli | This accounting provides explicit notification of when a |
pid is no longer in use. |
wli | At first glance, this doesn't seem terribly important, but |
there is a particular operation that would like to know. |
wli | During fork(), a process needs to be assigned a unique |
ID. This is called "pid allocation". |
wli | The 2.4.x pid allocator has a basic strategy of "scan the |
tasklist and check to see if the ID I guessed at the beginning is |
actually good". |
wli | It's actually a somewhat refined, but not very |
deterministic algorithm. |
wli | The 2.4.x pid allocator maintains a "window" of safe |
pid's, and moves, grows, and shrinks this window as it allocates |
pid's and scans the tasklist. |
wli | In the "best" case, this results in O(tasks) performance, |
but in the worst, it's O(tasks**2). And it's actually provable that over a |
sequence of operations, things average out so that even though |
some times it may have taken O(tasks**2) time, it averages out to |
O(tasks). |
wli | This is in general okay, though it's more questionable |
on SMP: |
wli | the trouble is that its worst case can be severe enough |
when there are enough tasks that deadlock detection mechanisms |
can be set off, like the NMI watchdog, by competing things |
attempting to read_lock(&tasklist_lock). |
wli | er |
wli | the trouble is that its worst case can be severe enough |
when there are enough tasks that deadlock detection mechanisms |
can be set off, like the NMI watchdog, by competing things |
attempting to write_lock_irq(&tasklist_lock). |
wli | The 2.5.x pid allocator uses the hashtable updates to |
help. |
wli | void detach_pid(task_t *task, enum pid_type type) |
wli | { |
wli | int nr = __detach_pid(task, type); |
wli | if (!nr) |
wli | return; |
wli | for (type = 0; type < PIDTYPE_MAX; ++type) |
wli | if (find_pid(type, nr)) |
wli | return; |
wli | free_pidmap(nr); |
wli | } |
wli | The for () loop is the important part here. |
wli | It is doing "for all types of process ID's, if we can find this |
ID number in use, return without doing anything" |
wli | If the ID number is completely unused, it crosses it off |
the bitmap. |
wli | This bitmap has an entry for each possible pid number, |
and if the bit is set, the ID number is in use, and if it's not set, the |
number is not in use. |
wli | <sarnold:#qc> a 32k bitmap for pids? |
wli | Actually no |
wli | The PID space was also enlarged in order to be able to |
run more processes. |
wli | The maximum allowable PID space is 4M large, and its |
size is controllable via sysctl |
wli | There are actually 2 levels to the map: |
wli | The first is a level pointing to pages with a count of free |
pid's within it. |
wli | The second is a page whose bits are used to track |
individual pid's. |
wli | <chbm:#qc> uint32 |
wli | <chbm:#qc> why not use it all :) |
wli | The only reason is that in general the space overhead |
of enough tasks to utilize a PID space that large renders actually |
using it all infeasible, at least on 32-bit. |
wli | For 64-bit machines with much memory a sufficient |
number of tasks to utilize it may be feasible, but it isn't useful due to |
the overhead. |
wli | It's large enough to accommodate the numbers of tasks |
being demanded by real workloads, which is usually not much |
more than 100K in the largest cases, and since the scanning is O(1) |
(I'll get back to that in a while), the size of the space is mostly |
arbitrary. |
wli | You'll notice that in alloc_pidmap(): |
wli | scan_more: |
wli | offset = find_next_zero_bit(map->page, |
BITS_PER_PAGE, offset); |
wli | if (offset >= BITS_PER_PAGE) |
wli | goto next_map; |
wli | if (test_and_set_bit(offset, map->page)) |
wli | goto scan_more; |
wli | This means that it is actually walking the bitmap and (in |
theory) taking O(sizeof(pidspace)); in practice the cache effects |
dominate, and the algorithm dominates direct tasklist scanning, and |
is expected O(1). |
wli | With some additional indexing space overhead, it could |
be made worst-case O(1), but any gain in the average case is likely |
to be marginal. |
wli | Future directions: |
wli | The remaining tasklist scans aren't as performance |
critical as the ones above. |
wli | In the core, there is an uncommon operation, which is |
to set the priority of all tasks belonging to a given user, and several |
error paths that scan for tasks having a given mm. |
wli | Also global capability mask setting, and sending a |
signal to all tasks. |
wli | The mm scanning for dumping core and OOM killing is |
not considered very useful to optimize (by me), and the uid |
scanning is for such a rare operation it's unclear what would benefit |
from the overhead of maintaining the list links. |
wli | Some "miscellaneous" other cases also occur: |
wli | netfilter's ipt_owner module scans the tasklist trying to |
find a task with a matching task->comm |
wli | This actually can become performance critical for |
machines supporting network-intensive workloads using ipt_owner, |
but there is not quite enough infrastructure to deal with it possibly |
being a module. |
wli | There is also a very badly-behaved quadratic tasklist |
scan in /proc/ to simulate reading the root /proc/ directory, where |
there is not yet an acceptable solution. |
wli | One thing where there is a clear direction is task |
validation. |
wli | There is a new mechanism for reference counting task |
structures so that some guarantee can be had of a task structure |
remaining valid. |
wli | There are several parts of the kernel, notably vm86 and |
SBUS environment control drivers, that want to hold references to |
tasks, but the only method they (think) they have of verifying is |
comparing task structure pointers in a tasklist scan. |
wli | This doesn't actually work, because the structures are |
slab-allocated and can be reused by different tasks. |
wli | The solution is now to either hold a reference count on |
the task structure or to put in a hook in release_thread() |
wli | These are clear bugfixes, so there's no doubt about this |
strategy being useful. |
wli | There is an additional infrastructure that would |
potentially allow netfilter for instance to deal with the hashtable |
update problem: |
wli | That is, registering functions to be called at fork(), |
exec(), exit(), and other useful task-related events. |
wli | At the moment, there is not a wide variety of potential |
users, and given the unusual nature of the things that need it, there |
may never be. |
wli | But it has the ability to provide the infrastructure for |
facilities like SBUS environment control, vm86, and netfilter to be |
modules yet still efficient. |
wli | <cdub:#qc> do you still intend to provide a generic |
scanning function with a callback for comparison? |
wli | My intention there was to actually very strictly control the |
users of such an interface and limit its use solely to preapproved |
catastrophic events in order to prevent reintroductions of tasklist |
scanning algorithms or generally inefficient code; I'm a little less |
worked up about it these days, and sort of doubt that kind of |
admission control would be popular. |
wli | <cdub:#qc> and LSM has hooks in those spots that |
could do the updates. (similarly task ornaments would help, i think). |
wli | I was actually trying to plug task ornaments for this kind |
of stuff. =) |
wli | Reusing the LSM hooks for these events sounds very |
interesting, though it would potentially be difficult to justify putting |
something like ipt_owner's ->comm hashing needs into what's |
supposedly a security model... a sort of "rename game" might be |
useful to deal with that issue. |
wli | Another future direction which is very likely to actually |
happen is some method of reducing the statically allocated space |
footprint of the pidhashes. |
wli | And a further future direction would be to compare the |
somewhat more deterministic multilevel bitmap approaches to the |
current approach to determine whether the average performance in |
real workloads is negatively effected by the cache effects of always |
updating several levels to the point where the gains in determinism |
in the current algorithm's worst cases are not worth it. |
wli | I'm not really planning on investigating that unless I get |
really bored; the worst cases have been impossible to trigger. |
wli | Also note on 32-bit machines with PAE (which requires |
3-level pagetables) the 2nd-level pagetables (pmd's) are always |
instantiated. |
wli | This means 12KB of overhead per task. |
wli | 32K tasks * 12KB == 768MB, so pmd's consume all of |
lowmem at that point; so for a very important case, the 32K space is |
already enough. |
wli | Last, but not least, I'd like to thank Ingo Molnar, whose |
bugfixes, cleanups, and promotion of these methods got them into |
the kernel. |
wli | Okay, thanks to everyone who showed up and listened, |
too. =) |
sarnold | thanks wli :) |
> clap clap clap clap clap clap clap clap clap clap |
#linux Cannot send to channel |
sarnold | thanks also to gareoda, translating to dutch in |
#taee... |
sarnold | vizard will translate to spanish later on, for the |
wbsite... |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |
gcc | wli: thank for the talk.. |
fernand0 | plas plas plas plas plas pals plas pals |
fernand0 | plas plas plas plas plas pals plas pals |
fernand0 | plas plas plas plas plas pals plas pals |
fernand0 | plas plas plas plas plas pals plas pals |
Oroz | thanks wli |
> clap clap clap clap clap clap clap clap clap clap |
zwane | plas plas plas =) |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |
erikm | klap klap klap klap klap |
wli | Anytime. I only hope it was interesting & informative |
enough to satisfy everyone. |
gcc | clap clap clap clap clap clap clap clap clap clap |
> excellent !!! |
> thank you very much !!! |
> clap clap clap clap clap clap clap clap clap clap |
Ducky_Pha | well done |
> clap clap clap clap clap clap clap clap clap clap |
apuigsech | clep clap plas clep clap plas plas plas clep clep |
apuigsech | clep clap plas clep clap plas plas plas clep clep |
apuigsech | clep clap plas clep clap plas plas plas clep clep |
apuigsech | :) |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |
zwane | wli is drowned in a sea of 'plas'es |
sarnold | wli: so, a sysctl allows one to use upwards of 4M of |
unswappable memory for a pidbitmap? woah.. |
zwane | sarnold: hehe |
wli | 4M bits, that'd be 0.5MB |
sarnold | ahhhh |
zwane | wli: does -bk1-wli1 address this? |
wli | No it does not. |
wli | The enlarged space (> 32K) is actually useful, and it's |
relatively painful to allocate-on-demand. |
sarnold | wli: would you think there to be any value in pids |
that don't wrap? |
wli | sarnold: Mostly impossible to guarantee. Even 64-bit |
can wrap after sustained fork() & exit() activity. |
sarnold | wli: not hoping to run 4 billion processes at once |
but as a way of uniquely identifying a process over the lifetime of the |
machine? |
wli | You could use a 128-bit PID. |
zwane | sarnold: you mean never recycle? |
sarnold | zwane: yeah |
zwane | nyark |
sarnold | zwane: i'm thinking of it from an audit perspective |
or something... |
zwane | sarnold: i can understand your pov |
wli | Something slightly larger than 64-bit should actually get |
it above reasonable hardware lifetimes. |
sarnold | zwane: with the possibility that knowing the pids will |
never wrap might be useful internally too |
zwane | sarnold: pid+timestamp |
zwane | or some ugly variation of |
wli | sarnold: never recycling would fix a number of |
userspace races also |
sarnold | heh, though converting userspace to handle >30 |
character pids might not be fun :) |
chbm | if you say pid is now %Ld people are likely to shoot |
you :) |
wli | Guaranteeing a long enough interval between pid |
wraps is useful; 32K is actually so small that larger systems can |
cycle it in a matter of seconds or less. |
chbm | humm |
chbm | pid starvation is trivial then ? |
fernand0 | thank you very much for the talk |
wli | Not starvation per-se. The pid space enlargement |
should handle that easily. Just making it so that resource scalability |
issues can't cause rapid wrapping due to high pidspace occupancy |
plus making it large enough the computation required to cycle |
through an entire wrap takes "a while", e.g. several minutes is |
enough to put a lot less stress on userspace wrt. races involving |
pid's as unique identifiers. |
wli | Yeah, yeah, it's a kernel fix to a userspace problem, but |
fixing all the pid-as-unique-id races in userspace is infeasible from |
my POV. |
sarnold | wli: you're probably right on that one :-/ |
sarnold | wli: too many $$ in shell scripts... |
botijo | what's up, pask |
apuigsech | wli, one question about linux kernel x86 GDT table... |
sarnold | garoeda: sent :) |
garoeda | i know, i was late :-( (thought 19 utc thus 20 |
localtime but it was already in localtime on the conference list...i |
noticed it, but too late) |
apuigsech | why is null (not used) the second entry? |
garoeda | sarnold: ok :-) |
wli | apuigsech: Ask away. |
wli | I wonder if 90 minutes was about the right length. |
vizard | it is over? :( |
wli | yeah |
sarnold | vizard: yup |
vizard | damn job |
vizard | wli, iŽll have to translate your talk later, i hope you |
dinŽt use too-technical words on it :) |
sarnold | vizard: good luck :) |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |
> clap clap clap clap clap clap clap clap clap clap |