EMPEROR | Welcome one more time to UMEET 2003 |
---|---|
EMPEROR | place to know more and understand about the Unix system |
EMPEROR | As usual, the talk will be here, at this channel and we have prepared |
EMPEROR | the channel #redes where a set of volunteers will translate from English |
EMPEROR | to Spanish. |
EMPEROR | I remember to question is at #qc channel |
EMPEROR | Today, our next lecturer is William Lee Irwin III from United States |
EMPEROR | He is contributor to the Linux kernel and debian package maintainer and is currently certified at Master level |
EMPEROR | William Lee Irwin III, AKA "wli" is one of the developers helping to implement a reverse mapping feature into the Linux kernel. |
EMPEROR | Mr. Irwin has been with us in every congress and he is bachelor of science degree from Purdue University, where I majored in mathematics and computer science. |
EMPEROR | He talk about: "What is gpcl?" so.. |
EMPEROR | Wil start this presentation.. |
EMPEROR | wli.. |
wli | I would like to introduce the "pgcl" patch, and hopefully give an understanding of what it is, why it's useful, and a high-level idea of how it works. |
wli | First, "pgcl" is an abbreviation for "page clustering". I called it this because there is some precedent for the idea in older operating systems, though they've never implemented some of the aspects of it. |
wli | There are two important components to pgcl. |
wli | The first is changing the definition of a "page", and making a new distinction between the pages the kernel uses for allocation, and the pages that cpu hardware uses for address translation. |
wli | In pgcl, PAGE_SIZE now represents what the page allocator returns for a 0-order allocation, and what a struct page describes. |
wli | There is another notion of page size, MMUPAGE_SIZE, used by the VM and machine-specific code, that describes how large an area a pagetable entry maps and is often used to interpret hardware descriptions. |
wli | These two sizes are related, as PAGE_SIZE has to be a power-of-two multiple of MMUPAGE_SIZE. |
wli | There is a second component to pgcl which makes it different from the precedents in older operating systems. |
wli | This is that pgcl "preserves ABI". To explain what this means, I'll describe what an the operating system that coined the term "page clustering" did and compare the differences userspace sees in pgcl and in that. |
wli | BSD, when originally ported to the VAX, ran on machines with very little memory. |
wli | The VAX was no different in that regard, but it had 512B pages, which created a very large problem with respect to memory footprint. |
wli | BSD used a 16B data structure for every page, called a cmap_t (analogous to Linux' struct page), and using one for every 512B of RAM was very expensive. |
wli | So it grouped hardware pages and pretended that was the pagesize. |
wli | A distinction between the two notions of page size was made, using macros with names similar to PAGE_SIZE and MMUPAGE_SIZE. |
wli | In fact, this survives to this day in some open source BSD's. |
wli | But this had an effect on userspace. Userspace before the change could mmap() on a 512B boundary. |
wli | But to run with that change, it had to be recompiled to use a 2KB or 4KB boundary. |
wli | This is where Hugh Dickins' work, and hence pgcl, are substantially different. Instead of simply using PAGE_SIZE everywhere in the VM, it teaches things that look at pagetables to allow fragments of the kernel's idea of a page to be mmap()'d and faulted in. |
wli | mmap() isn't difficult to understand. It just divides numbers by MMUPAGE_SIZE instead of PAGE_SIZE. |
wli | Fault handling is much more involved. Fault handlers find a pagetable and the MMUPAGE-aligned offset into the file the faulting address corresponds to. |
wli | They now have to be taught to be able to use the rest of the MMUPAGE_SIZE-sized fragments of the page so they don't waste memory. |
wli | There is a notion something like what it does called "faultahead". In faultahead, you find nearby pagetable entries to fill in when a fault is taken in the hopes that you can avoid page faults on the surrounding virtualspace. |
wli | In pgcl, this is almost a requirement, at least for anonymous memory. Or it could be considered a performance requirement. |
wli | The difference here is that surrounding pagetable entries mean "the rest of the page" instead of "surrounding pages". |
wli | After all this effort, something wonderful happens. The benefits (which I've not described yet) have almost no cost. |
wli | Userspace can't see the difference at all when it does mmap(). |
wli | Applications don't have to be recompiled so that their sections (data, read-only data, text, and so on) fall on the kernel's software PAGE_SIZE (the page cluster) boundaries. |
wli | This is very important for legacy applications and for flexibility in the choice of PAGE_SIZE, since some boundary has to be chosen for linking executables. |
wli | So now that we know what pgcl is doing, let's talk about what it's trying to achieve. |
wli | Originally, the ancient BSD analogue of pgcl was used to reduce the kernel's memory footprint. pgcl can also be used for that. |
wli | Since what the per-page structures track is larger, there are fewer of them. |
wli | Linux' struct page is large. It's 40B on 32-bit machines, and 80B on 64-bit. Sometimes it's more, depending on certain architecture-specific options. |
wli | This is 2.5MB on a 256MB 32-bit machine, and 5MB on a 64-bit machine with the same amount of RAM. |
wli | (assuming 4KB PAGE_SIZE as in mainline) |
wli | On some machines, like ia32 PAE machines, the space for the per-page structures is constant, while total memory can be very large. |
wli | This is a very large memory footprint in absolute terms, though in relative terms it's only 1% or so. |
wli | For 64-bit machines the situation is worse because pointers are larger. |
wli | So, to get back the megabytes of swapspace, pgcl can be used. |
wli | However, the 2.4 work had very different goals. |
wli | In Linux, many filesystems are block-based, and use a core API based on structures describing the IO state of fragments of pages called buffer_heads. |
wli | These structures can only describe IO state (IO in progress, IO complete, IO not done, etc.) of pieces of memory smaller than the kernel's page size. |
wli | Or equal to it. |
wli | There is a general wish to use blocks as large as possible up to some limit in order to reduce the number of IO requests and "seeks" between different areas of a disk. |
wli | The reasoning there is that with large blocks, IO to an area of a file will be covered by a smaller number of areas of a disk, and so the drive won't have to wait for the platter to rotate. |
wli | Or at least not as much. |
wli | Now the use of the API that assumes blocks must be smaller than pages is an obstacle to doing this. |
wli | The approach pgcl takes is to enlarge the kernel's internal notion of a page's size, and then the blocks can be as large as that size. |
wli | I've been told the value is minor, but it also has the benefit of being able to read filesystems of larger blocksizes on hardware with smaller pagesizes than that blocksize. |
wli | Another benefit is an effect of something mentioned earlier, "faultahead". |
wli | As programs touch the memory they've requested via mmap(), they only fault in enough to fill in one pagetable entry at a time in mainline. |
wli | Cpus have been growing faster for straight line execution, but things like jumps and exceptions have been growing in their cost relative to straight-line execution paths. |
wli | Each fault to fill in a pte is one of those exceptions whose handling has been growing progressively slower. |
wli | This was not originally cited as a benefit, but as someone who generally follows what the VM is doing, I've tracked the fault counts for various programs as they execute. pgcl significantly reduces the number of faults taken. |
wli | There is yet another benefit, analogous to faultahead, which is very natural to expect. |
wli | Pages are kept in various places like the LRU, the pagecache's radix trees, and per-inode lists, and VM and filesystem algorithms do lookups and iterate over these structures. |
wli | It can take a lot of these page structures linked into these structures to represent a given amount of RAM, which can make them very deep. |
wli | For instance, the radix trees have a branching factor of something like 64. |
wli | To represent all the memory cached for a 32MB file, it requires a 3-level radix tree in mainline. |
wli | With a 64KB PAGE_SIZE, it would only require 2 levels. |
wli | During writeback, say, for fsync(), the list of all pages attached to an inode is walked over. |
wli | with a 4KB PAGE_SIZE, a 32MB file would have to walk over a list of 8192 pages. |
wli | But with a 64KB PAGE_SIZE this is reduced by a factor of 16 to 512. |
wli | So, now that I've described all the expected benefits, what have I actually gotten out of it? |
wli | The answer is basically that the patch is still in a very immature state. |
wli | It has stability problems running some benchmarks, and there are bad algorithms that need to be replaced. |
wli | There is, however, good news. |
wli | Some versions, around 2.5.74 and 2.5.65, had better performance than mainline on tiobench; the reasons why this lead hasn't been maintained needs investigation sometime after stability issues are addressed. |
wli | The original patch improved performance on a benchmark simulating a multiuser load by 5%. |
wli | The memory footprint benefits have been well-demonstrated. It's been ported to some ia32 PAE machines, and results demonstrating it improves PAE (resource) scalability have been posted to lkml. In fact, the first post ever of a Linux machine running on a 64GB ia32 machine running Linux, and also doing so in a useful fashion. |
wli | So, in summary, pgcl is a way to pull a constant factor out of thin air from every algorithm in the VM with no cost in terms of ABI. I'm looking forward to bringing it to a state where it's generally useful soon. |
wli | That's a wrap! |
wli | Questions here or on #qc? |
wli | Any questions? |
EMPE[log] | maybe someone can say something .. any commentary or question? this is the oportunity... |
EMPE[log] | Thanxs Wli... |
EMPE[log] | is very interesting .. |
EMPE[log] | we'll publish the logs for this talk as soon as possible in our website |
EMPE[log] | plas plas plas plas plas plas plas plas |
EMPE[log] | plas plas plas plas plas plas plas plas |
EMPE[log] | plas plas plas plas plas plas plas plas |
EMPE[log] | plas plas plas plas plas plas plas plas |
DeZ | clap clap clap clap clap |
DeZ | clap clap clap clap clap clap clap clap clap clap |
MJesus | there are aplause ! |
ducky | clap clap clap clap clap clapclap clapclap clap |
ducky | clap clap clap clap clap clap |