sarnoldwelcome to the fourth day of umeet presentations :)
sarnoldtoday's first speaker is william lee irwin, III
wliSo for general background, generalized pid hashing is
supposed to improve the performance of in-kernel algorithms so
they scale better with the number of tasks.
sarnold(please direct questions and comments to #qc;
thanks)
joined #linux
wliIn general, larger systems tend to run things like forking
servers or run many different servers (possibly from different users)
simultaneously.
wliAlmost all of the 2.5.x alterations done for this deal
directly with very traditional computational complexity issues from
basic algorithmic analysis.
#linux
wliSo to begin looking further in-depth, let's examine some
2.4.x code that was seen to have issues.
wli(not just ones directly addressed by pidhashing, but
ones that involve searches over all tasks)
wli(1) count_active_tasks()
wliThis is trying to compute the load average.
wliIn 2.4.x, the body of this function is:
wli        for_each_task(p) {
wli                if ((p->state == TASK_RUNNING ||
wli                     (p->state & TASK_UNINTERRUPTIBLE)))
wli                        nr += FIXED_1;
wli        }
wliThe notion is simple: count up the number of tasks that
are either TASK_RUNNING or TASK_UNINTERRUPTIBLE
wlithe 2.5.x code does the same thing, but implements it
using a different principle:
wli"Keep count, and when asked, just hand back the
number."
wli<grimoire:#qc> this may be silly, but what does
FIXED_1 define as?
wliFIXED_1 is a constant used for fixed point arithmetic.
wliIt represents 1.
wliThere is a tradeoff in this strategy: changing the state
has new accounting overhead, but examining the state is much
cheaper.
wliThe new accounting overhead takes the form of
incrementing and decrementing counters and moving things
between lists and so on.
wlicount_active_tasks() is actually somewhat instrumental
in another way: it has been seen to be a reliability problem with
increased clock rates on SMP servers (or at least I've triggered the
problem in testing).
wliThis is because the timer tick happens often, but the
time spent in it depends on the number of tasks
wliSo when it takes long enough to scan all tasks, you
experience something called "timeslice overrun".
wliThis kind of problem is not unique to that function.
wli(2) sys_setpgid()
wliThis is not a frequently exercised function, or at least not
usually considered performance critical, but the 2.4.x
implementation shows some of the tradeoffs.
wliIf I may back up by a small margin:
wli<jmgv:#qc> but when do you update the task counter? i
suposed each task add 1 when it born and sub 1 before kill.. isn't it?
wliIn general, alterations to task->state are done to pass a
sort of implicit argument to the scheduler.
wliSo basically, the scheduler increments and decrements
the counter for TASK_RUNNING when things sleep or wake, and
increments and decrements a second counter for
TASK_UNINTERRUPTIBLE when things sleep and wake. In
principle a single couner could be used, but TASK_RUNNING
wants to be counted in isolation.
wliThis is 2.5.x's nr_uninterruptible() and rq->nr_uninterrupt
ible
wliSo back to (2)
wliThe critical part of this to examine is:
wli        if (pgid != pid) {
wli                struct task_struct * tmp;
wli                for_each_task (tmp) {
wli                        if (tmp->pgrp == pgid &&
wli                            tmp->session == current->session)
wli                                goto ok_pgid;
wli                }
wli                goto out;
wliAt this point we see that there are 2 checks happening:
wli(A) checking for someone already having a given
process group ID
wli(B) checking for someone being in the same session
wliIt makes one think "Wouldn't it be nice if there were
some other way to find them?"
wliIn 2.5.x things are a bit different
wliThe old infrastructure in 2.4.x has a hashtable only for
task->pid
wliIn 2.5.x now there are hashtables for task->pid,
task->tgid, task->pgrp, and task->session
wliThat same code looks like:
wli                for_each_task_pid(pgid, PIDTYPE_PGID, p, l,
pid)
wli                        if (p->session == current->session)
wli                                goto ok_pgid;
wli<seletz:#qc> maybe a dump question: What do those
hash tables contain? key, value i mean.
wliOkay this question will help explain what goes on in that
code.
wliThere is a "struct pid" representing a given "process id"
wliThe notion of "process id" is a generalization, so it
doesn't just mean PID, it includes PID, PGID, TGID, and SID.
wliThe struct pid has inside of it a list of all the tasks sharing
the same value of that ID.
wliFor instance, suppose there were 20 tasks in a a given
process group.
cored[net3port46.tricom.net])
wliThe struct pid for that pgrp will show up in the
hashtable's bucket lists, and you look at the struct pid and walk the
list inside it.
wliThis should give you an idea of what's in
for_each_task_pid() already.
wli#define for_each_task_pid(who, type, task, elem, pid)      
\
wli        if ((pid = find_pid(type, who)))                        \
wli                for (elem = pid->task_list.next,                        \
wli                        prefetch(elem->next),                           \
wli                        task = pid_task(elem, type);                    \
wli                        elem != &pid->task_list;                        \
wli                        elem = elem->next, prefetch(elem->next),
\
wli                        task = pid_task(elem, type))
wlipid = find_pid(type, who) /* find_pid() does hashtable
lookup, and finds a struct pid that matches */
wlispecifically, find_pid() takes as the first argument what
kind of pid (PGID/SID/etc.), and as its second argument the value.
wliThis is used to start up the loop.
wliThe actual looping part looks at pid->task_list
wliand is essentially a list_for_each_entry()
wliYou'll notice that in a struct pid there are 2 sets of list
links:
wliOne for the hashtable and one for the list of tasks: this is
to help deal with sharing.
wliNow this is where the principle I mentioned about state
changing overhead vs. state examination overhead comes into play:
wliThese hashtables have to be updated at fork(), exec(),
and exit().
wliThere is another class of side-effects:
wliThe locking changes in some (less important) cases.
wlisys_setpgid() is one of those cases.
wliIn 2.4.x there was no structure associated with pgrp's
that had to be altered when they change, so 2.4.x took the
tasklist_lock for reading.
wliIn 2.5.x there is, so the tasklist_lock must be taken for
writing.
wliThis is a different tradeoff: certain operations are now
less parallelizable, though they are more efficient in serial.
wliThe general tendency with these operations that were
serialized in 2.5.x is that they are "uncommon" operations, ones that
aren't usually requested often or considered performance-critical,
where the operations that were speeded up were either not
seriously affected by accounting overhead, or drastically speeded
up, and already performed in serial.
wliThere was another side-effect, and in fact, it was used to
promote this infrastructure for the kernel:
wliThis accounting provides explicit notification of when a
pid is no longer in use.
wliAt first glance, this doesn't seem terribly important, but
there is a particular operation that would like to know.
wliDuring fork(), a process needs to be assigned a unique
ID. This is called "pid allocation".
wliThe 2.4.x pid allocator has a basic strategy of "scan the
tasklist and check to see if the ID I guessed at the beginning is
actually good".
wliIt's actually a somewhat refined, but not very
deterministic algorithm.
wliThe 2.4.x pid allocator maintains a "window" of safe
pid's, and moves, grows, and shrinks this window as it allocates
pid's and scans the tasklist.
wliIn the "best" case, this results in O(tasks) performance,
but in the worst, it's O(tasks**2). And it's actually provable that over a
sequence of operations, things average out so that even though
some times it may have taken O(tasks**2) time, it averages out to
O(tasks).
wliThis is in general okay, though it's more questionable
on SMP:
wlithe trouble is that its worst case can be severe enough
when there are enough tasks that deadlock detection mechanisms
can be set off, like the NMI watchdog, by competing things
attempting to read_lock(&tasklist_lock).
wlier
wlithe trouble is that its worst case can be severe enough
when there are enough tasks that deadlock detection mechanisms
can be set off, like the NMI watchdog, by competing things
attempting to write_lock_irq(&tasklist_lock).
wliThe 2.5.x pid allocator uses the hashtable updates to
help.
wlivoid detach_pid(task_t *task, enum pid_type type)
wli{
wli        int nr = __detach_pid(task, type);
wli        if (!nr)
wli                return;
wli        for (type = 0; type < PIDTYPE_MAX; ++type)
wli                if (find_pid(type, nr))
wli                        return;
wli        free_pidmap(nr);
wli}
wliThe for () loop is the important part here.
wliIt is doing "for all types of process ID's, if we can find this
ID number in use, return without doing anything"
wliIf the ID number is completely unused, it crosses it off
the bitmap.
wliThis bitmap has an entry for each possible pid number,
and if the bit is set, the ID number is in use, and if it's not set, the
number is not in use.
wli<sarnold:#qc> a 32k bitmap for pids?
wliActually no
wliThe PID space was also enlarged in order to be able to
run more processes.
wliThe maximum allowable PID space is 4M large, and its
size is controllable via sysctl
wliThere are actually 2 levels to the map:
wliThe first is a level pointing to pages with a count of free
pid's within it.
wliThe second is a page whose bits are used to track
individual pid's.
wli<chbm:#qc> uint32
wli<chbm:#qc> why not use it all :)
wliThe only reason is that in general the space overhead
of enough tasks to utilize a PID space that large renders actually
using it all infeasible, at least on 32-bit.
wliFor 64-bit machines with much memory a sufficient
number of tasks to utilize it may be feasible, but it isn't useful due to
the overhead.
wliIt's large enough to accommodate the numbers of tasks
being demanded by real workloads, which is usually not much
more than 100K in the largest cases, and since the scanning is O(1)
(I'll get back to that in a while), the size of the space is mostly
arbitrary.
wliYou'll notice that in alloc_pidmap():
wliscan_more:
wli        offset = find_next_zero_bit(map->page,
BITS_PER_PAGE, offset);
wli        if (offset >= BITS_PER_PAGE)
wli                goto next_map;
wli        if (test_and_set_bit(offset, map->page))
wli                goto scan_more;
wliThis means that it is actually walking the bitmap and (in
theory) taking O(sizeof(pidspace)); in practice the cache effects
dominate, and the algorithm dominates direct tasklist scanning, and
is expected O(1).
wliWith some additional indexing space overhead, it could
be made worst-case O(1), but any gain in the average case is likely
to be marginal.
wliFuture directions:
wliThe remaining tasklist scans aren't as performance
critical as the ones above.
wliIn the core, there is an uncommon operation, which is
to set the priority of all tasks belonging to a given user, and several
error paths that scan for tasks having a given mm.
wliAlso global capability mask setting, and sending a
signal to all tasks.
wliThe mm scanning for dumping core and OOM killing is
not considered very useful to optimize (by me), and the uid
scanning is for such a rare operation it's unclear what would benefit
from the overhead of maintaining the list links.
wliSome "miscellaneous" other cases also occur:
wlinetfilter's ipt_owner module scans the tasklist trying to
find a task with a matching task->comm
wliThis actually can become performance critical for
machines supporting network-intensive workloads using ipt_owner,
but there is not quite enough infrastructure to deal with it possibly
being a module.
wliThere is also a very badly-behaved quadratic tasklist
scan in /proc/ to simulate reading the root /proc/ directory, where
there is not yet an acceptable solution.
wliOne thing where there is a clear direction is task
validation.
wliThere is a new mechanism for reference counting task
structures so that some guarantee can be had of a task structure
remaining valid.
wliThere are several parts of the kernel, notably vm86 and
SBUS environment control drivers, that want to hold references to
tasks, but the only method they (think) they have of verifying is
comparing task structure pointers in a tasklist scan.
wliThis doesn't actually work, because the structures are
slab-allocated and can be reused by different tasks.
wliThe solution is now to either hold a reference count on
the task structure or to put in a hook in release_thread()
wliThese are clear bugfixes, so there's no doubt about this
strategy being useful.
wliThere is an additional infrastructure that would
potentially allow netfilter for instance to deal with the hashtable
update problem:
wliThat is, registering functions to be called at fork(),
exec(), exit(), and other useful task-related events.
wliAt the moment, there is not a wide variety of potential
users, and given the unusual nature of the things that need it, there
may never be.
wliBut it has the ability to provide the infrastructure for
facilities like SBUS environment control, vm86, and netfilter to be
modules yet still efficient.
wli<cdub:#qc> do you still intend to provide a generic
scanning function with a callback for comparison?
wliMy intention there was to actually very strictly control the
users of such an interface and limit its use solely to preapproved
catastrophic events in order to prevent reintroductions of tasklist
scanning algorithms or generally inefficient code; I'm a little less
worked up about it these days, and sort of doubt that kind of
admission control would be popular.
wli<cdub:#qc> and LSM has hooks in those spots that
could do the updates. (similarly task ornaments would help, i think).
wliI was actually trying to plug task ornaments for this kind
of stuff. =)
wliReusing the LSM hooks for these events sounds very
interesting, though it would potentially be difficult to justify putting
something like ipt_owner's ->comm hashing needs into what's
supposedly a security model... a sort of "rename game" might be
useful to deal with that issue.
wliAnother future direction which is very likely to actually
happen is some method of reducing the statically allocated space
footprint of the pidhashes.
wliAnd a further future direction would be to compare the
somewhat more deterministic multilevel bitmap approaches to the
current approach to determine whether the average performance in
real workloads is negatively effected by the cache effects of always
updating several levels to the point where the gains in determinism
in the current algorithm's worst cases are not worth it.
wliI'm not really planning on investigating that unless I get
really bored; the worst cases have been impossible to trigger.
wliAlso note on 32-bit machines with PAE (which requires
3-level pagetables) the 2nd-level pagetables (pmd's) are always
instantiated.
wliThis means 12KB of overhead per task.
wli32K tasks * 12KB == 768MB, so pmd's consume all of
lowmem at that point; so for a very important case, the 32K space is
already enough.
wliLast, but not least, I'd like to thank Ingo Molnar, whose
bugfixes, cleanups, and promotion of these methods got them into
the kernel.
wliOkay, thanks to everyone who showed up and listened,
too. =)
sarnoldthanks wli :)
> clap clap clap clap clap clap clap clap clap clap
#linux Cannot send to channel
sarnoldthanks also to gareoda, translating to dutch in
#taee...
sarnoldvizard will translate to spanish later on, for the
wbsite...
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap
gccwli: thank for the talk..
fernand0plas plas plas plas plas pals plas pals
fernand0plas plas plas plas plas pals plas pals
fernand0plas plas plas plas plas pals plas pals
fernand0plas plas plas plas plas pals plas pals
Orozthanks wli
> clap clap clap clap clap clap clap clap clap clap
zwaneplas plas plas =)
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap
erikmklap klap klap klap klap
wliAnytime. I only hope it was interesting & informative
enough to satisfy everyone.
gccclap clap clap clap clap clap clap clap clap clap
> excellent !!!
> thank you very much !!!
> clap clap clap clap clap clap clap clap clap clap
Ducky_Phawell done
> clap clap clap clap clap clap clap clap clap clap
apuigsechclep clap plas clep clap plas plas plas clep clep
apuigsechclep clap plas clep clap plas plas plas clep clep
apuigsechclep clap plas clep clap plas plas plas clep clep
apuigsech:)
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap
zwanewli is drowned in a sea of 'plas'es
sarnoldwli: so, a sysctl allows one to use upwards of 4M of
unswappable memory for a pidbitmap? woah..
zwanesarnold: hehe
wli4M bits, that'd be 0.5MB
sarnoldahhhh
zwanewli: does -bk1-wli1 address this?
wliNo it does not.
wliThe enlarged space (> 32K) is actually useful, and it's
relatively painful to allocate-on-demand.
sarnoldwli: would you think there to be any value in pids
that don't wrap?
wlisarnold: Mostly impossible to guarantee. Even 64-bit
can wrap after sustained fork() & exit() activity.
sarnoldwli: not hoping to run 4 billion processes at once
but as a way of uniquely identifying a process over the lifetime of the
machine?
wliYou could use a 128-bit PID.
zwanesarnold: you mean never recycle?
sarnoldzwane: yeah
zwanenyark
sarnoldzwane: i'm thinking of it from an audit perspective
or something...
zwanesarnold: i can understand your pov
wliSomething slightly larger than 64-bit should actually get
it above reasonable hardware lifetimes.
sarnoldzwane: with the possibility that knowing the pids will
never wrap might be useful internally too
zwanesarnold: pid+timestamp
zwaneor some ugly variation of
wlisarnold: never recycling would fix a number of
userspace races also
sarnoldheh, though converting userspace to handle >30
character pids might not be fun :)
chbmif you say pid is now %Ld people are likely to shoot
you :)
wliGuaranteeing a long enough interval between pid
wraps is useful; 32K is actually so small that larger systems can
cycle it in a matter of seconds or less.
chbmhumm
chbmpid starvation is trivial then ?
fernand0thank you very much for the talk
wliNot starvation per-se.  The pid space enlargement
should handle that easily. Just making it so that resource scalability
issues can't cause rapid wrapping due to high pidspace occupancy
plus making it large enough the computation required to cycle
through an entire wrap takes "a while", e.g. several minutes is
enough to put a lot less stress on userspace wrt. races involving
pid's as unique identifiers.
wliYeah, yeah, it's a kernel fix to a userspace problem, but
fixing all the pid-as-unique-id races in userspace is infeasible from
my POV.
sarnoldwli: you're probably right on that one :-/
sarnoldwli: too many $$ in shell scripts...
botijowhat's up, pask
apuigsechwli, one question about linux kernel x86 GDT table...
sarnoldgaroeda: sent :)
garoedai know, i was late :-( (thought 19 utc thus 20
localtime but it was already in localtime on the conference list...i
noticed it, but too late)
apuigsechwhy is null (not used) the second entry?
garoedasarnold: ok :-)
wliapuigsech: Ask away.
wliI wonder if 90 minutes was about the right length.
vizardit is over? :(
wliyeah
sarnoldvizard: yup
vizarddamn job
vizardwli, iŽll have to translate your talk later, i hope you
dinŽt use too-technical words on it :)
sarnoldvizard: good luck :)
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap
> clap clap clap clap clap clap clap clap clap clap

Generated by irclog2html.pl by Jeff Waugh - find it at freshmeat.net!