Borja | A continnuacion damos paso a Martin Bligh, que nos va a hablar de VM y NUMA |
---|---|
Borja | Es uno de los autores del kernel 2.6 junto con Linus Thorvalds |
Borja | Trabaja en IBM. |
Borja | La charla se traducira a espanol en #redes y a holandes en #taee |
Borja | Now, we are going to read Mrtgin Bligh. He is one of the authors of the Linux 2.6 kernel, and he works for IBM. |
Borja | This speech will be translated to Spanish in #redes, and Dutch in #taee |
MJesus | please, Question and commentary in #qc |
Borja | Welcome, Martin |
Borja | :-) |
mbligh | I'm going to focus on NUMA machines, and on the 2.6 kernel. |
mbligh | Thanks - to keep this to a reasonable time limit, I'm going to make some simplifications - nothing too important, but feel free to ask questions if there's something I'm skipping over. |
mbligh | NUMA = non-uniform memory architecture |
mbligh | where we basically have an SMP machine with non-uniform characteristics - memory, cpus, and IO buses are not equally spaced from each other. |
mbligh | The reason for this is that it's very hard (and expensive) to build a flat uniform SMP machine of > 4CPUs or so |
mbligh | once you start to build a machine that big, you start to have to slow down the system buses, etc to cope with the size |
mbligh | you can get a larger, faster machine by having some sets of resources closer to each other than others. |
mbligh | If you don't make that decision, basically you end up with a machine where *everyting* becomes slower, instead of just a few resources |
mbligh | We normally define those groupings as "nodes" - a typical machine might have 4 cpus per nodes, some memory and some IO buses on each node |
mbligh | There are now newer architectures, like AMD's x86_64 that have one cpu per node, and local memory for each processor. |
mbligh | with that advent of that, we have commodity NUMA systems for much lower prices ($2000 or so) |
mbligh | and much greater interest in the technology in the markerplace. |
mbligh | Often, machines that are slightly non-uniform (ie slightly NUMA) are sold as SMP for simplicity's sake. |
mbligh | Large machines from companies like SGI now have 512 CPUs or more. |
mbligh | It might help to envisage the machine as a group of standard SMP machines, connected by a very fast interconnect |
mbligh | somewhat like a network connection, except that the transfers over that bus are transparent to the operating system |
mbligh | Indeed, some earlier system were built exactly like that - the older Sequent NUMA-Q hardware uses a standard 450NX 4x chipset, with a "magic" NUMA interface plugged into the system bus of each node to interconnect them, and pass traffice between them. |
mbligh | The traditional measure of how NUMA a machine is (how non-uniform) is to take a simple ratio of the memory latency to access local memory vs remote memory |
mbligh | ie if I can do a local memory access (memory on same node as CPU) in 80ns, and a remote memory access (cpu on different node from memory) in 800ns, the ratio is expressed as 10:1 |
mbligh | Unfortuantely, that's only a very approximate description of the machine |
mbligh | and doesn't take into account lots of important factors, such as the bandwidth of the interconnect |
mbligh | once the interconnect starts to become contended, that 800ns could easily become 8000ns. |
mbligh | thus it's very important for us to keep accesses local wherever possible. |
mbligh | Often, we're asked why people don't use clusters of smaller machines, instead of a large NUMA machine |
mbligh | indeed, that would be much cheaper, for the same amount of CPU horsepower. |
mbligh | unfortunately, it makes the application's work much harder - all of the intercommunication and load balancing how has to be more explicit, and more complex |
mbligh | some large applications (eg. database servers) don't split up onto multiple cluster nodes easily |
mbligh | in those sort of situations, people often use NUMA machines. |
mbligh | Another nice effect of using a NUMA machine is that the balancing problems, etc are solved once in the operating system, instead of repeatedly in every application that runs on it. |
mbligh | We also abstract the hardware knowledge down into the OS, so that the applications become more portable. |
mbligh | There are several levels of NUMA support we could have: |
mbligh | 1. Pretend that the hardware is SMP, ignoring locality and NUMA characteristics |
mbligh | 2. Implicit support - the OS tries to use local resources where possible, and group applications and relevant resources together as closely as possible. |
mbligh | 3. Explicit support - provide an API to userspace, whereby the application can specify (in some abstracted fashion) to the OS what it wants |
mbligh | in Linux 2.6.0 kernel, we're really at the start of stage 2. We've got some very basic support for bits of stage 3. |
mbligh | Stage 1 does work, but doesn't get very good performance |
mbligh | that's what I did when I first ported Linux to the Sequent NUMA-Q platform |
mbligh | The first step to NUMA support is to try to allocate memory local to the CPU that the application is running on. |
mbligh | we do that by default in Linux whenever the kernel calls __alloc_pages (the main memory allocator) |
mbligh | if we run out of memory on the local node, it automatically falls back to get memory from another node, it's just slower. |
mbligh | NUMA support is enabled by CONFIG_NUMA, which depends on CONFIG_DISCONTIGMEM |
mbligh | though the memory may actually not be discontiguous (as 'discontigmem" suggests) - that's just historical |
mbligh | So instead of having 3 memory zones for the system (eg ZONE_DMA, ZONE_NORMAL, ZONE_HIGHMEM) we now end up with 3 zones for each node ... though many of them often end up to be empty (no memory in them). |
mbligh | For each physical page of memory in a Linux system we have a 'struct page" control entry, which we group into an array called mem_map, which is one contiguous block of memory |
mbligh | on NUMA systems, we break that into one array per node (lmem_map[node]) |
mbligh | but we still essentially have one struct page per physical page of memory |
mbligh | breaking that array up gives us several advantages - one is code simplicity |
mbligh | another is that we can allocate those control structures from the nodes own memory improving access times. |
mbligh | On a typical system, we might have 16GB of RAM, and 4 nodes (each node has 4GB of memory) |
mbligh | that ends up as physical address ranges 0-4GB on node 0, 4-8GB on node1, 8-12Gb on node2, and 12-16GB on node 3 |
mbligh | on a 32 bit system, that presents a bit of a problem - all of the kernel's permanent memory is allocated from the first GB of physical RAM |
mbligh | unfortunately, that turns out to be rather hard to fix - many drivers assume that the physical addresses for the kernel memory (ZONE_NORMAL) will fit into an unsigned long (32 bits) |
mbligh | so we can't easily spread that memory around the system ... that's a performance problem that we still have |
mbligh | Some things (eg the lmem_map arrays) are relocated by some special hacks at boot time, but most of the structres (eg entries for the dcache, and inode cache) all have to reside on node 0 still. |
mbligh | One of the other things we do to reduce cross-node traffic is that instead of one global swapd daemon (kswapd) to do page reclaim, we have one daemon per node, each scanning just its own node's pages. |
mbligh | that gives us a lot better performance, and lower inconnect traffic during memory reclaim. |
mbligh | we can also replicate copies of read-only data to each node. |
mbligh | for instance, we can copy the kernel binary to each node, and have the CPUs on each node only map into their own local copy - this uses only a little extra memory, and saves lots of interconnect bandwidth |
mbligh | Dave Hansen has also created a patch to replicate read only user data (eg the text of glibc, and programs like gcc) to each node, which creates a similar benefit. |
mbligh | that gave us a 5% - 40% performance increase, dependant on what other patches we were using together with it, and what benchmark we ran. |
mbligh | replicating read-write data would be difficult (keeping multiple copies in sync on write) and probably not worth the benefit. |
mbligh | so for now, we'll just do read-only data. |
mbligh | The 2.6 VM also has per-node LRU lists (least recently used lists of which memory pages have been accessed recently) |
mbligh | instead of a global list. |
mbligh | not only does this give us more localized access to information, but also allows us to break up pagemap_lru_lock |
mbligh | which is the lock controlling the LRU lists - before we broke that up, we were spending 50% of the system time during a kernel compile just spinning waiting for that one lock |
mbligh | once it was broken up, the time is so small it's now unmeasurable. |
mbligh | OK ... that's most of the big VM stuff ... scheduler next. |
mbligh | there's not much point in carefully allocating node-local memory to a process if we then migrate that process then instantly gets migrated to another node |
mbligh | oops. |
mbligh | there's not much point in carefully allocating node-local memory to a process if we then instantly migrate that process to another node |
mbligh | So we took the basic O(1) scheduler in 2.6, and changed it a little |
mbligh | with the O(1) scheduler, there's 1 runqueue of tasks per node |
mbligh | gah ... 1 runqueue of tasks per cpu, sorry. |
mbligh | in flat SMP mode, each CPU runs tasks just from its own runqueue, and we occasionally balance between different runqueues |
mbligh | but on a NUMA system, we don't want to migrate tasks between nodes if possible |
mbligh | we want to keep them on the local node - to keep caches warm and memory local. |
mbligh | So we change the standard balancing algorithm to only balance between the runqueues of CPUs on the same node as each other |
mbligh | and we add another balancing algorithm, which is much more conservative, to balance tasks between nodes. |
mbligh | So we rarely balance tasks between nodes. |
mbligh | However, at the exec() time of a process, it has very little state (we've just said to overwrite it with a new process, effectively) |
mbligh | so at exec time, we also do a cross-node rebalance, and throw the exec'ed task to the most lightly loaded node |
mbligh | that actually does a lot of our balancing for us, and is nearly free to do. |
mbligh | The code that's in 2.6 to support NUMA scheduling is still fairly basic, and needs lots more work. |
mbligh | My main plan for the future is to keep track of the RSS (resident set size of memory) of each task on a per-node basis, as well as global |
mbligh | then we can use that information to make better decisions - if most of a processes memory is on node X, we should try to migrate that process to node X |
mbligh | To get good rebalancing, we also want to take into account how much CPU the task is using - it's going to have more effect to rebalance a task if it's using more CPU. |
mbligh | but it's cheaper to migrate if it has a smaller cache footprint (which we approximately measure by RSS) |
mbligh | So we end up with a "goodness" calculation for migration that's something like "cpu_percentage/(remote_rss - local_rss)" |
mbligh | OK ... enough scheduler ... just a little bit on IO. |
mbligh | If we have a SAN (storage area network) with an IO connection into it from each node (eg switch fibrechannel) |
mbligh | then it makes sense to use NUMA-aware MPIO (multi-path IO) |
mbligh | we simply try to route the IO traffic over the local IO interface, and receive the traffic back on that same interface. |
mbligh | That obviously cuts back a lot on the traffic over the main interconnect. |
mbligh | If an interface should go down (die) on the local node, we can always fall back to using the remote node's IO adaptor instead. |
mbligh | On the other hand, many machines (especially the AMD64 boxes) don't have this kind of setup |
mbligh | instead, most IO is just connected to one node |
mbligh | to cope well with that, we really need to try to hook into the scheduler, and run tasks that are IO bound on the node closest to the IO resources they're accessing |
mbligh | and run the CPU bound jobs, etc, on the other nodes. |
mbligh | we don't have support for that in Linux yet, but may well add it during 2.6. |
mbligh | Oh, and the same principles apply here for both network IO and disk IO |
mbligh | though network IO is a little more complex, as we have to cope with what to do for IP addresses for the machine, etc. |
mbligh | The only remaining section I'll cover is just to touch on the userspace APIs a little |
mbligh | that's what I referred to as "stage 3" above. |
mbligh | we present some information on the topology of the NUMA machine via sysfs |
mbligh | eg the groupings of which CPUs are on which node (and which nodes contain which CPUs) |
mbligh | we also present meminfo on a per node basis (a la /proc/meminfo) |
mbligh | so you can monitor how much free/allocated memory there is on which node, and what it's allocated to |
mbligh | there's also mappings for which PCI buses are on which node, |
mbligh | and an out-of-tree set of patches to allow users to specify which node memory should be allocated from (not finished yet, but will be in the next few months) |
mbligh | Andi Kleen and Matt Dobson are working on that. |
mbligh | we can also use the sys_sched_affinity stuff from Robert Love to bind processes to specifici CPUs, and hence to nodes. |
mbligh | OK ... that's about it ... questions? |
mbligh | <sydb> If/when you're taking questions: I'm interested how Linux Numa compares with proprietary numa (e.g. AIX)... also is Numa a stopgap while "real" SMP scales up cheaply? Who is using Numa on Linux and why? |
mbligh | I think the main difference between OSes like AIX and Sequent's PTX from Linux (in terms of NUMA support) is that we don't have much of the explicit stuff for userspace done |
mbligh | I'm not sure that's so important as the explicit stuff - we have to get big applications like DB2 and Oracle to port to use it |
mbligh | I'd prefer to have the OS make sensible decisions for the applications where possible |
mbligh | the scheduling stuff in other OSes is probably also a lot more sophisitcated |
mbligh | we need to make a better job of things like task groupings - run threads of the same process on the same node, where possible. |
mbligh | As to NUMA being a stopgap whilst SMP scales up .... |
mbligh | No, not really - there's some fundamental limitiations that make it impossible. |
mbligh | as the bus gets longer (adding more CPUs and memory), you just can't run it as fast |
mbligh | that's physics ... not much we can do to fix it ;-) |
mbligh | so we end up breaking into multiple smaller busses, and interconnects between them |
mbligh | ... who is using NUMA on Linux and why? |
mbligh | mainly people who want really large machines like database servers - we've sold a lot of our x440 platform (ia32 based). |
mbligh | it's a cost-effective way to get a larger machine |
mbligh | it'd be a damned sight easier to code for if we had a 64 bit chip though ;-) |
mbligh | <sydb> are there enough people working on Numa in Linux to catch up with the competition? Is it worth more people getting involved? |
mbligh | Firstly, Linux has no competition ;-) |
mbligh | But seriously ... not really. we need people testing on multiple different architectures |
mbligh | I'd like to see more work on the AMD stuff, especially for the scheduler. |
mbligh | on a 1 cpu per node system, the concept of migrating tasks "within" a node makes little sense |
mbligh | Erich Focht has patches to fix it up ... but they've been around for 3 months or so, and nobody has tested them. rather sad. |
mbligh | <sydb> can you do Numa with normal non-numa hardware (emulate over ethernet?) thus open development up? |
mbligh | <sydb> lol |
mbligh | Not really - part of what's really complex about NUMA is to make a machine that can do *transparent* access to remote memory, AND keep it cache coherent |
mbligh | cache coherency is the really hard part - if I write to mem location X from a cpu on node 1, and then read that mem location on node 2, it has to get the new value to work properly |
mbligh | I'm very interested in what's called SSI clustering - "single system image clustering" - that's a single OS image running across a traditional cluster. |
mbligh | which is, I think, what you're thinking of, but we don't normally call it NUMA. |
mbligh | however, it's a REALLY hard problem to solve well ;-) |
mbligh | <athkhz> How about NUMA in other architectures like PPC970? |
mbligh | There's no NUMA PPC970 boxes that I know of |
mbligh | however, I'd love to see one - it'd make a great architecture ;-) |
mbligh | please send me one when you make it. |
mbligh | However ... the PPC970 is the little sister of Power4+ chip |
mbligh | the Regatta (P690) is some of IBM's highest end hardware, and is based on Power4+, and that is NUMA |
mbligh | it's not marketed as NUMA, but it is. |
mbligh | So ... we almost have what you want ;-) |
mbligh | <clsk> What's your personal opinion about microkernels vs. conventional kernels? (i'm sorry if i'm going a little bit too much off the topic) |
mbligh | I'm pretty much of the same opinion as I think Linus is here - microkernels a nice idea, but too inefficient in practice. Convential kernels can work IF you excercise good programming discipline |
mbligh | it's a bit like OO programming - tries to force you into a stricter model, but you *can* write well modularised code in C if you try hard. |
mbligh | <Arador> Why it's said that athlons are somewhat close to a "small NUMA" system? |
mbligh | I guess that's in two senses ... small as in they're normally low CPU count (the interconnect can only cope with up to 8x unless someone makes a switch) |
mbligh | and small as in they have a low ratio of memory access latency (as described above) |
mbligh | NUMA-Q is about 10:1 ... x440 is about 4:1 .... x86_64 is about 1.6:1 to 2.5:1 depending on how big the system is |
mbligh | <pepita> has openmosix something to do with this? |
mbligh | I presume that was in reference to the SSI stuff - yes. OpenMosix is an implementation of SSI. |
mbligh | there's also OpenSSI. I don't really like either of them though ;-) |
mbligh | <sydb> the lower the ratio the better, no? |
mbligh | Very roughly speaking ... yes. |
mbligh | However, as Sun has ably shown in the past ... we can acheive 1:1 by just slowing down local access. |
mbligh | the whole point of NUMA isn't really to give you slower remote memory - it's to give you *faster* local access |
mbligh | I'd prefer local:remote of 80ns:160ns to 120ns:150ns, even though the ratios on the latter look better. |
mbligh | <pepita> why don't you like ssi clusters? what's "wrong with them? |
mbligh | I *do* like SSI clusters in general ...just not those implementations ;-) |
mbligh | Lots of heavy invasive code rewrite stuff .... |
mbligh | too much task migration between nodes, in some implementations. It's a long topic to get into ;-) |
mbligh | <clsk> Armadillo asks: Numa is what type of interconnector in a hypercube network and does it use some type of special network protocol, or is it all managed by the hardware? |
mbligh | There's no specific interconnect used ... the NUMA-Qs that I work with a lot used something called SCI |
mbligh | but basically, it's all transparently mananged by the hardware, so it doesn't matter to the OS much. |
mbligh | you could use ethernet as the transport if needed. |
mbligh | but you'd need much more than a standard ethernet adaptor to do transparent remote memory access and cache coherency ;-) |
mbligh | OK ... any more questions? |
mbligh | I'll let the translators catch up ;-) OK ... that's all folks ;-). |
krocz | thanks mbligh |
andrew | Great talk... cheers... |
sydb | thanks mbligh, clap clap clap clap clap |
EMPE[log] | plas plas plas plas plas plas plas |
MJaway | clap clap clap clap clap clap clap clap clap clap |
MJaway | clap clap clap clap clap clap clap clap clap clap |