IV International Conference of Unix at Uninet
  • Presentation
  • Register
  • Program
  • Organizing Comittee
  • Listing of registered people
  • Translators team
Talk

20031217-2.en

BorjaA continnuacion damos paso a Martin Bligh, que nos va a hablar de VM y NUMA
BorjaEs uno de los autores del kernel 2.6 junto con Linus Thorvalds
BorjaTrabaja en IBM.
BorjaLa charla se traducira a espanol en #redes y a holandes en #taee
BorjaNow, we are going to read Mrtgin Bligh. He is one of the authors of the Linux 2.6 kernel, and he works for IBM.
BorjaThis speech will be translated to Spanish in #redes, and Dutch in #taee
MJesusplease, Question and commentary in #qc
BorjaWelcome, Martin
Borja:-)
mblighI'm going to focus on NUMA machines, and on the 2.6 kernel.
mblighThanks - to keep this to a reasonable time limit, I'm going to make some simplifications - nothing too important, but feel free to ask questions if there's something I'm skipping over.
mblighNUMA = non-uniform memory architecture
mblighwhere we basically have an SMP machine with non-uniform characteristics - memory, cpus,  and IO buses are not equally spaced from each other.
mblighThe reason for this is that it's very hard (and expensive) to build a flat uniform SMP machine of > 4CPUs or so
mblighonce you start to build a machine that big, you start to have to slow down the system buses, etc to cope with the size
mblighyou can get a larger, faster machine by having some sets of resources closer to each other than others.
mblighIf you don't make that decision, basically you end up with a machine where *everyting* becomes slower, instead of just a few resources
mblighWe normally define those groupings as "nodes" - a typical machine might have 4 cpus per nodes, some memory and some IO buses on each node
mblighThere are now newer architectures, like AMD's x86_64 that have one cpu per node, and local memory for each processor.
mblighwith that advent of that, we have commodity NUMA systems for much lower prices ($2000 or so)
mblighand much greater interest in the technology in the markerplace.
mblighOften, machines that are slightly non-uniform (ie slightly NUMA) are sold as SMP for simplicity's sake.
mblighLarge machines from companies like SGI now have 512 CPUs or more.
mblighIt might help to envisage the machine as a group of standard SMP machines, connected by a very fast interconnect
mblighsomewhat like a network connection, except that the transfers over that bus are transparent to the operating system
mblighIndeed, some earlier system were built exactly like that - the older Sequent NUMA-Q hardware uses a standard 450NX 4x chipset, with a "magic" NUMA interface plugged into the system bus of each node to interconnect them, and pass traffice between them.
mblighThe traditional measure of how NUMA a machine is (how non-uniform) is to take a simple ratio of the memory latency to access local memory vs remote memory
mblighie if I can do a local memory access (memory on same node as CPU) in 80ns, and a remote memory access (cpu on different node from memory) in 800ns, the ratio is expressed as 10:1
mblighUnfortuantely, that's only a very approximate description of the machine
mblighand doesn't take into account lots of important factors, such as the bandwidth of the interconnect
mblighonce the interconnect starts to become contended, that 800ns could easily become 8000ns.
mblighthus it's very important for us to keep accesses local wherever possible.
mblighOften, we're asked why people don't use clusters of smaller machines, instead of a large NUMA machine
mblighindeed, that would be much cheaper, for the same amount of CPU horsepower.
mblighunfortunately, it makes the application's work much harder - all of the intercommunication and load balancing how has to be more explicit, and more complex
mblighsome large applications (eg. database servers) don't split up onto multiple cluster nodes easily
mblighin those sort of situations, people often use NUMA machines.
mblighAnother nice effect of using a NUMA machine is that the balancing problems, etc are solved once in the operating system, instead of repeatedly in every application that runs on it.
mblighWe also abstract the hardware knowledge down into the OS, so that the applications become more portable.
mblighThere are several levels of NUMA support we could have:
mbligh1. Pretend that the hardware is SMP, ignoring locality and NUMA characteristics
mbligh2. Implicit support - the OS tries to use local resources where possible, and group applications and relevant resources together as closely as possible.
mbligh3. Explicit support - provide an API to userspace, whereby the application can specify (in some abstracted fashion) to the OS what it wants
mblighin Linux 2.6.0 kernel, we're really at the start of stage 2. We've got some very basic support for bits of stage 3.
mblighStage 1 does work, but doesn't get very good performance
mblighthat's what I did when I first ported Linux to the Sequent NUMA-Q platform
mblighThe first step to NUMA support is to try to allocate memory local to the CPU that the application is running on.
mblighwe do that by default in Linux whenever the kernel calls __alloc_pages (the main memory allocator)
mblighif we run out of memory on the local node, it automatically falls back to get memory from another node, it's just slower.
mblighNUMA support is enabled by CONFIG_NUMA, which depends on CONFIG_DISCONTIGMEM
mblighthough the memory may actually not be discontiguous (as 'discontigmem" suggests) - that's just historical
mblighSo instead of having 3 memory zones for the system (eg ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM) we now end up with 3 zones for each node ... though many of them often end up to be empty (no memory in them).
mblighFor each physical page of memory in a Linux system we have a 'struct page" control entry, which we group into an array called mem_map, which is one contiguous block of memory
mblighon NUMA systems, we break that into one array per node (lmem_map[node])
mblighbut we still essentially have one struct page per physical page of memory
mblighbreaking that array up gives us several advantages - one is code simplicity
mblighanother is that we can allocate those control structures from the nodes own memory improving access times.
mblighOn a typical system, we might have 16GB of RAM, and 4 nodes (each node has 4GB of memory)
mblighthat ends up as physical address ranges 0-4GB on node 0, 4-8GB on node1, 8-12Gb on node2, and 12-16GB on node 3
mblighon a 32 bit system, that presents a bit of a problem - all of the kernel's permanent memory is allocated from the first GB of physical RAM
mblighunfortunately, that turns out to be rather hard to fix - many drivers assume that the physical addresses for the kernel memory (ZONE_NORMAL) will fit into an unsigned long (32 bits)
mblighso we can't easily spread that memory around the system ... that's a performance problem that we still have
mblighSome things (eg the lmem_map arrays) are relocated by some special hacks at boot time, but most of the structres (eg entries for the dcache, and inode cache) all have to reside on node 0 still.
mblighOne of the other things we do to reduce cross-node traffic is that instead of one global swapd daemon (kswapd) to do page reclaim, we have one daemon per node, each scanning just its own node's pages.
mblighthat gives us a lot better performance, and lower inconnect traffic during memory reclaim.
mblighwe can also replicate copies of read-only data to each node.
mblighfor instance, we can copy the kernel binary to each node, and have the CPUs on each node only map into their own local copy - this uses only a little extra memory, and saves lots of interconnect bandwidth
mblighDave Hansen has also created a patch to replicate read only user data (eg the text of glibc, and programs like gcc) to each node, which creates a similar benefit.
mblighthat gave us a 5% - 40% performance increase, dependant on what other patches we were using together with it, and what benchmark we ran.
mblighreplicating read-write data would be difficult (keeping multiple copies in sync on write) and probably not worth the benefit.
mblighso for now, we'll just do read-only data.
mblighThe 2.6 VM also has per-node LRU lists (least recently used lists of which memory pages have been accessed recently)
mblighinstead of a global list.
mblighnot only does this give us more localized access to information, but also allows us to break up pagemap_lru_lock
mblighwhich is the lock controlling the LRU lists - before we broke that up, we were spending 50% of the system time during a kernel compile just spinning waiting for that one lock
mblighonce it was broken up, the time is so small it's now unmeasurable.
mblighOK ... that's most of the big VM stuff ... scheduler next.
mblighthere's not much point in carefully allocating node-local memory to a process if we then migrate that process then instantly gets migrated to another node
mblighoops.
mblighthere's not much point in carefully allocating node-local memory to a process if we then instantly migrate that process to another node
mblighSo we took the basic O(1) scheduler in 2.6, and changed it a little
mblighwith the O(1) scheduler, there's 1 runqueue of tasks per node
mblighgah ... 1 runqueue of tasks per cpu, sorry.
mblighin flat SMP mode, each CPU runs tasks just from its own runqueue, and we occasionally balance between different runqueues
mblighbut on a NUMA system, we don't want to migrate tasks between nodes if possible
mblighwe want to keep them on the local node - to keep caches warm and memory local.
mblighSo we change the standard balancing algorithm to only balance between the runqueues of CPUs on the same node as each other
mblighand we add another balancing algorithm, which is much more conservative, to balance tasks between nodes.
mblighSo we rarely balance tasks between nodes.
mblighHowever, at the exec() time of a process, it has very little state (we've just said to overwrite it with a new process, effectively)
mblighso at exec time, we also do a cross-node rebalance, and throw the exec'ed task to the most lightly loaded node
mblighthat actually does a lot of our balancing for us, and is nearly free to do.
mblighThe code that's in 2.6 to support NUMA scheduling is still fairly basic, and needs lots more work.
mblighMy main plan for the future is to keep track of the RSS (resident set size of memory) of each task on a per-node basis, as well as global
mblighthen we can use that information to make better decisions - if most of a processes memory is on node X, we should try to migrate that process to node X
mblighTo get good rebalancing, we also want to take into account how much CPU the task is using - it's going to have more effect to rebalance a task if it's using more CPU.
mblighbut it's cheaper to migrate if it has a smaller cache footprint (which we approximately measure by RSS)
mblighSo we end up with a "goodness" calculation for migration that's something like "cpu_percentage/(remote_rss - local_rss)"
mblighOK ... enough scheduler ... just a little bit on IO.
mblighIf we have a SAN (storage area network) with an IO connection into it from each node (eg switch fibrechannel)
mblighthen it makes sense to use NUMA-aware MPIO (multi-path IO)
mblighwe simply try to route the IO traffic over the local IO interface, and receive the traffic back on that same interface.
mblighThat obviously cuts back a lot on the traffic over the main interconnect.
mblighIf an interface should go down (die) on the local node, we can always fall back to using the remote node's IO adaptor instead.
mblighOn the other hand, many machines (especially the AMD64 boxes) don't have this kind of setup
mblighinstead, most IO is just connected to one node
mblighto cope well with that, we really need to try to hook into the scheduler, and run tasks that are IO bound on the node closest to the IO resources they're accessing
mblighand run the CPU bound jobs, etc, on the other nodes.
mblighwe don't have support for that in Linux yet, but may well add it during 2.6.
mblighOh, and the same principles apply here for both network IO and disk IO
mblighthough network IO is a little more complex, as we have to cope with what to do for IP addresses for the machine, etc.
mblighThe only remaining section I'll cover is just to touch on the userspace APIs a little
mblighthat's what I referred to as "stage 3" above.
mblighwe present some information on the topology of the NUMA machine via sysfs
mbligheg the groupings of which CPUs are on which node (and which nodes contain which CPUs)
mblighwe also present meminfo on a per node basis (a la /proc/meminfo)
mblighso you can monitor how much free/allocated memory there is on which node, and what it's allocated to
mblighthere's also mappings for which PCI buses are on which node,
mblighand an out-of-tree set of patches to allow users to specify which node memory should be allocated from (not finished yet, but will be in the next few months)
mblighAndi Kleen and Matt Dobson are working on that.
mblighwe can also use the sys_sched_affinity stuff from Robert Love to bind processes to specifici CPUs, and hence to nodes.
mblighOK ... that's about it ... questions?
mbligh<sydb> If/when you're taking questions: I'm interested how Linux Numa compares with proprietary numa (e.g. AIX)... also is Numa a stopgap while "real" SMP scales up cheaply? Who is using Numa on Linux and why?
mblighI think the main difference between OSes like AIX and Sequent's PTX from Linux (in terms of NUMA support) is that we don't have much of the explicit stuff for userspace done
mblighI'm not sure that's so important as the explicit stuff - we have to get big applications like DB2 and Oracle to port to use it
mblighI'd prefer to have the OS make sensible decisions for the applications where possible
mblighthe scheduling stuff in other OSes is probably also a lot more sophisitcated
mblighwe need to make a better job of things like task groupings - run threads of the same process on the same node, where possible.
mblighAs to NUMA being a stopgap whilst SMP scales up ....
mblighNo, not really - there's some fundamental limitiations that make it impossible.
mblighas the bus gets longer (adding more CPUs and memory), you just can't run it as fast
mblighthat's physics ... not much we can do to fix it ;-)
mblighso we end up breaking into multiple smaller busses, and interconnects between them
mbligh... who is using NUMA on Linux and why?
mblighmainly people who want really large machines like database servers - we've sold a lot of our x440 platform (ia32 based).
mblighit's a cost-effective way to get a larger machine
mblighit'd be a damned sight easier to code for if we had a 64 bit chip though ;-)
mbligh<sydb> are there enough people working on Numa in Linux to catch up with the competition? Is it worth more people getting involved?
mblighFirstly, Linux has no competition ;-)
mblighBut seriously ... not really. we need people testing on multiple different architectures
mblighI'd like to see more work on the AMD stuff, especially for the scheduler.
mblighon a 1 cpu per node system, the concept of migrating tasks "within" a node makes little sense
mblighErich Focht has patches to fix it up ... but they've been around for 3 months or so, and nobody has tested them. rather sad.
mbligh<sydb> can you do Numa with normal non-numa hardware (emulate over ethernet?) thus open development up?
mbligh<sydb> lol
mblighNot really - part of what's really complex about NUMA is to make a machine that can do *transparent* access to remote memory, AND keep it cache coherent
mblighcache coherency is the really hard part - if I write to mem location X from a cpu on node 1, and then read that mem location on node 2, it has to get the new value to work properly
mblighI'm very interested in what's called SSI clustering  - "single system image clustering" - that's a single OS image running across a traditional cluster.
mblighwhich is, I think, what you're thinking of, but we don't normally call it NUMA.
mblighhowever, it's a REALLY hard problem to solve well ;-)
mbligh<athkhz> How about NUMA in other architectures like PPC970?
mblighThere's no NUMA PPC970 boxes that I know of
mblighhowever, I'd love to see one - it'd make a great architecture ;-)
mblighplease send me one when you make it.
mblighHowever ... the PPC970 is the little sister of Power4+ chip
mblighthe Regatta (P690) is some of IBM's highest end hardware, and is based on Power4+, and that is NUMA
mblighit's not marketed as NUMA, but it is.
mblighSo ... we almost have what you want ;-)
mbligh<clsk> What's your personal opinion about microkernels vs. conventional kernels? (i'm sorry if i'm going a little bit too much off the topic)
mblighI'm pretty much of the same opinion as I think Linus is here - microkernels a nice idea, but too inefficient in practice. Convential kernels can work IF you excercise good programming discipline
mblighit's a bit like OO programming - tries to force you into a stricter model, but you *can* write well modularised code in C if you try hard.
mbligh<Arador> Why it's said that athlons are somewhat close to a "small NUMA" system?
mblighI guess that's in two senses ... small as in they're normally low CPU count (the interconnect can only cope with up to 8x unless someone makes a switch)
mblighand small as in they have a low ratio of memory access latency (as described above)
mblighNUMA-Q is about 10:1 ... x440 is about 4:1 .... x86_64 is about 1.6:1 to 2.5:1 depending on how big the system is
mbligh<pepita> has openmosix something to do with this?
mblighI presume that was in reference to the SSI stuff -  yes. OpenMosix is an implementation of SSI.
mblighthere's also OpenSSI. I don't really like either of them though ;-)
mbligh<sydb> the lower the ratio the better, no?
mblighVery roughly speaking ... yes.
mblighHowever, as Sun has ably shown in the past ... we can acheive 1:1 by just slowing down local access.
mblighthe whole point of NUMA isn't really to give you slower remote memory - it's to give you *faster* local access
mblighI'd prefer local:remote of 80ns:160ns to 120ns:150ns, even though the ratios on the latter look better.
mbligh<pepita> why don't you like ssi clusters? what's "wrong with them?
mblighI *do* like SSI clusters in general ...just not those implementations ;-)
mblighLots of heavy invasive code rewrite stuff ....
mblightoo much task migration between nodes, in some implementations. It's a long topic to get into ;-)
mbligh<clsk> Armadillo asks: Numa is what type of interconnector in a hypercube network and does it use some type of special network protocol, or is it all managed by the hardware?
mblighThere's no specific interconnect used ... the NUMA-Qs that I work with a lot used something called SCI
mblighbut basically, it's all transparently mananged by the hardware, so it doesn't matter to the OS much.
mblighyou could use ethernet as the transport if needed.
mblighbut you'd need much more than a standard ethernet adaptor to do transparent remote memory access and cache coherency ;-)
mblighOK ... any more questions?
mblighI'll let the translators catch up ;-) OK ... that's all folks ;-).
kroczthanks mbligh
andrewGreat talk... cheers...
sydbthanks mbligh, clap clap clap clap clap
EMPE[log]plas plas plas plas plas plas plas
MJawayclap clap clap clap clap clap clap clap clap clap
MJawayclap clap clap clap clap clap clap clap clap clap

Generated by irclog2html.pl by Jeff Waugh - find it at freshmeat.net!

email usmore information


© 2003 - www.uninet.edu - contact organizing comittee - valid xhtml - valid css - design by raul pérez justicia