riel | ok ... |
---|---|
riel | it's good to be back at Umeet |
riel | this is now the 4th time I've participated at Umeet |
riel | and it's always been fun |
riel | today I'll talk about a few cool patches (and projects) that are almost ready to go into the 2.6 kernel |
riel | I don't have any text prepared and will just be talking "live" |
riel | so don't worry about interrupting me |
riel | you can ask questions at ANY time, in the #qc channel |
riel | there is also going to be a translation into spanish in #redes |
riel | and into dutch in #taee |
riel | I guess I should start by saying that the 2.6.0 kernel looks way better than the 2.0.0, 2.2.0 or 2.4.0 kernels |
riel | so you should all try the 2.6 kernel and report bugs to linux-kernel@vger.kernel.org ;)) |
riel | I'll be talking a bit about the following projects |
riel | - execshield |
riel | - 4/4 split |
riel | - CKRM (class-based kernel resource manager) |
riel | - memory hotplug |
riel | ... |
riel | the first patch I am going to talk about is security related |
riel | as you probably already know, the 2.6 kernel has some security improvements to help limit the damage that can be done through a security hole |
riel | for example, with selinux you can limit the damage that is done when sendmail is exploited AGAIN |
riel | or bind ;) |
riel | with exec-shield you can do another step towards making the system secure |
riel | basically the layout of your process memory gets changed a little bit |
riel | and data and the stack are by default not executable any more |
riel | this makes it a lot harder for a normal buffer overflow to turn into an exploit |
riel | on x86 CPUs the page tables do not have an executable permission bit, so exec-shield needs to do really ugly segmentation tricks |
riel | luckily most other CPUs Linux runs on, including AMD64, have executable bits in the page tables, so the segmentation tricks will no longer be needed in the future |
riel | <hans> riel: will this proces not bring down the speed of the os? |
riel | hans: yes, absolutely |
riel | segmentation makes the program run a little bit slower |
riel | however, the increased security will be worth the speed difference for some people |
riel | at this point I should also point out that exec-shield is NOT the most luxurious memory management change for security |
riel | PaX is probably a lot more flexible |
riel | but at the cost of more overhead than exec-shield |
riel | I suspect that exec-shield will be a reasonable compromise between extra security and performance for most people |
riel | it goes together with some tricks like randomising the start address of the stack, the heap and the executables and libraries |
riel | so using buffer overflows to jump to a libc function becomes very improbable |
riel | instead, the attacker will just crash sendmail, instead of getting a root shell |
riel | (or more likely, the overflow will not be an attack at all) |
riel | <xtingray> riel: what is the resources cheaper method that you know? |
riel | xtingray: well, the best would be using an AMD64 chip ;)) |
riel | that hardware has the executable bit for page tables built right in |
riel | and there is no performance penalty for a non-exec stack or heap ;) |
riel | <hans> riel: what makes this better then for example a chrooted enviroment for applications? |
riel | ok, exec-shield is not "better" ... I would use the two together |
riel | you can (and probably should) run named in a chroot environment |
riel | but once somebody breaks in, they can still use your computer to send network packets to somewhere else (maybe to help a DDoS?) |
riel | so reducing the chance that a buffer overflow can actually be exploited is probably a good thing to do |
riel | oh, you can get exec-shield as part of Arjan's 2.6 kernel RPM |
riel | at http://people.redhat.com/~arjanv/ |
riel | I have not seen it in any other kernel patch sets yet |
riel | <amplifiel> other distros like adamantix use rsbac and pax for security, what about this patches? |
riel | ok |
riel | rsbac is a bit like selinux |
riel | it helps a lot to reduce the damage after a program is broken into |
riel | but it does not help prevent break-ins into one program |
riel | PaX helps prevent such break-ins, but is a much higher cost than execshield |
riel | if you are really paranoid, you will probably prefer PaX over exec-shield |
riel | but personally I suspect that the performance impact of PaX (in particular the extra space use, meaning your programs have less address space) will make it too "expensive" for most people |
riel | are there any other questions on exec-shield ? |
riel | (otherwise I'll move on to the 4/4 split) |
riel | ... |
riel | ok, 4/4 split ;) |
riel | I'll now explain about what is probably the biggest problem Linux has on 32 bit x86 systems |
riel | the problem is that x86 can have up to 64GB of physical memory, but only 4GB virtual memory |
riel | and the classical Linux virtual memory layout means that the kernel only has 1GB of space! |
riel | that means, 1GB of space to manage 64GB of memory |
riel | that is simply not enough space if you run the kind of programs anybody with a 64GB server runs |
riel | to make a long story short, with 1GB kernel space, a system with more than about 24GB RAM is nearly useless |
riel | because you do not have the kernel memory to run the programs people with a big system run |
riel | in 2.6, and later 2.4 kernels, the page tables were moved to high memory |
riel | that is, they are stored outside of the 1GB of kernel space |
riel | that increased the limit from 16GB to 24 or 32GB |
riel | but still, nowhere near the 64GB that x86 systems can use |
riel | of course, the real solution is for the people with really big servers to use a 64 bit CPU |
riel | so the kernel has all the space it needs |
riel | but noooo, they want a cheap server ;(( |
riel | so they buy x86 |
riel | of course, the software people are always the ones left with the problem ;) |
riel | the simplest thing we can do is increase the kernel space to 4GB |
riel | but, there is only 4GB total available in Linux, divided between userspace and kernel space |
riel | so we need to change that |
riel | Ingo Molnar made a patch that does something pretty ugly, that just happens to work well and needs little changes in the rest of the kernel code |
riel | you know that every process has its own memory space |
riel | with Ingo's 4/4 split patch, the _kernel_ also has its own 4GB memory space |
riel | and every time you make a system call or an interrupt happens, the system does a memory context switch |
riel | into the 4GB large kernel memory space |
riel | this way the kernel has enough memory to manage 64GB of physical memory and the programs running in it |
riel | however, it does come at quite a cost |
riel | it commonly costs 10% performance |
riel | because the CPU needs to switch memory address spaces all the time |
riel | on some benchmarks the cost is as high as 30% ... |
riel | also, this is the last big change that can be done on 32 bit systems |
riel | if Intel ever comes out with a 32 bit chip that can address more than 64GB of physical memory, there is no next trick we can use |
riel | that is why I think that the only real solution is to use a 64 bit chip |
riel | if you need lots of memory |
riel | <jamesm> riel: how long do you think people will keep using ia32 for large systems? |
riel | I think they will keep using ia32 until Intel has a cheap 64 bit CPU |
riel | or until they need more than 128GB of memory |
riel | I am afraid that IA64 will never really become cheap |
riel | because it is designed as a very high-end chip |
riel | however, with AMD marketing their cheap 64 bit chip, I think Intel will have to come up with something |
riel | I really hope they do ... ;) |
riel | any other questions about the 4/4 split, or memory management issues ? |
riel | ok, I'll hold a 1 minute break to give the translators a chance to catch up |
riel | then I'll continue with CKRM, the class-based kernel resource manager |
riel | ... |
riel | CKRM, class-based kernel resource manager |
riel | this is the kind of project I have been dreaming about since the 2.0 kernel ;) |
riel | and some small aspects of it are in the kernel |
riel | basically, CKRM consists of two parts: |
riel | 1) a classifier, to group tasks into resource classes based on |
riel | - pid |
riel | - gid |
riel | - uid |
riel | - name |
riel | - resource class id |
riel | - ... |
riel | 2) resource control modules, that plug into the CKRM core and |
riel | - divide the CPU fairly between resource classes |
riel | - enforce memory limits between resource classes |
riel | - ... |
riel | basically, with CKRM you will be able to do things like: |
riel | "I want sendmail and all processes started by sendmail to consume no more than 10% of memory or 20% of the CPU" |
riel | so no matter how overloaded your mail queue is, your system as a whole will not be overloaded |
riel | or at a university, you could specify "the students get between 10% and 50% of memory, the staff get between 30% and 80% of memory, the system administrator gets as much as he wants" |
riel | the possibilities of what you can do with CKRM are nearly endless |
riel | I am sure those of you with BOFH inspiration can come up with some creative ideas ... |
riel | [again, if you have questions ask them in #qc] |
riel | you can find information on CKRM on http://ckrm.sourceforge.net/ |
riel | of course, CKRM has some serious downsides too |
riel | it is very cool and very flexible, but also very complex |
riel | I would not be surprised if CKRM was too complex for Linus |
riel | and things need to be made simpler before it can be merged into the 2.7 kernel |
riel | <BigSam72> ok, when CKRM will be implemented and limits set for example for sendmail, what happens when sendmail reach a limits ? memory allocations fail ? |
riel | in the most common case, sendmail would get swapped out |
riel | it would get virtual memory, just not physical memory |
riel | also, if the system has free memory that is not being used at the time, a resource class can just borrow that memory |
riel | <jamesm> riel: what is the performance overhead? |
riel | I cannot answer the performance overhead question yet, since CKRM is in very early stages |
riel | the code is not quite ready yet and needs a lot of work |
riel | I suspect the performance overhead will be small for most resource schedulers |
riel | <franl> Can CKRM control only CPU and memory usage, or can it control other things, like fork()s and send()s per second? |
riel | franl: currently CKRM can control CPU, memory and IO use only |
riel | but people are planning more resource modules |
riel | for CPU and IO, the CKRM module is a scheduler |
riel | so you can give certain bandwidth guarantees and maximum limits to resource groups |
riel | memory is fairly similar, except for one big difference |
riel | you have a new second of CPU time every second, but memory doesn't grow ;) |
riel | in computer science terms, memory is a non-renewable resource |
riel | so if a resource group uses more memory than its limit but something else needs it, the system needs to do work to take it away (swapping out) |
riel | for CPU, IO bandwidth or network scheduling the system does not need to do such work |
riel | for system administrators there is another issue to keep in mind |
riel | if you give every resource group in your system a 10% minimum guaranteed, make sure you don't have more than 10 resource groups ;)) |
riel | <franl> What's the system call interface to CKRM look like? Is it just a bunch of ioctl()s? |
riel | franl: the interface to userspace is still in flux |
riel | CKRM A0* used system calls, but CKRM B0* seems to be using a /proc interface |
riel | this could change again in the future, until Linus is happy ;) |
riel | <franl> Does Linus support CKRM in principle for 2.7 development? |
riel | I don't think he has been asked yet ;)) |
riel | it may be difficult to convince him that CKRM is cool |
riel | he never likes server-only things |
riel | Linus wants functionality to be useful for everybody |
riel | ... and he is right |
riel | however, CKRM may be useful for desktop systems |
riel | for example, the desktop user could get a guaranteed minimum amount of the system resources so updatedb cannot make the desktop slow |
riel | yes it's a hack, but if it helps making the desktop better ... ;) |
riel | any other questions about CKRM, before I move on to "memory hotplug" ? |
riel | ... |
riel | ok, memory hotplug ;) |
riel | big server manufacturers are working on a new piece of functionality |
riel | the idea is that system administrators can plug new memory (DIMMs) into the system, while the system is running |
riel | some even want the system administrator to be able to remove DIMMs |
riel | now, adding memory should be doable during the 2.6 kernel |
riel | we already have NUMA support in the kernel, to support different areas of memory in a system |
riel | when the system administrator adds new memory, we could create a new memory zone for that memory |
riel | and then hook up the new zone in the list of other memory zones |
riel | after that we can start using the memory |
riel | "simple" ... except for some details I will not bother you with now ;) |
riel | <franl> How do you remove DIMMs that have dirty pages in them? |
riel | ok ... memory removal is a BIG PROBLEM ;) |
riel | I don't think Linux is going to support that any time soon |
riel | if all the memory in a DIMM belongs to user programs, we could just swap them out when the administrator says he wants to remove a DIMM from the system |
riel | but what if the memory is mlocked and we're forbidden from swapping it out ? |
riel | or worse, what if the dimm contains kernel data structures that are referenced by physical memory address ?! |
riel | I don't see any good way to deal with that |
riel | I can think of a few BAD ways, but we don't want that ;) |
riel | <franl> Even if a DIMM can be purged of kernel pages and dirty user pages, you still have to hope the sysadmin pulls the right DIMM. :) |
riel | I guess that's what the little green and red lights are for ;)) |
riel | memory hotplug cards tend to have all kinds of status lights on them, luckily |
riel | also, why would you ever want to remove memory from a system ? |
riel | I can think of 2 things a system administrator needs to do: |
riel | 1) add more memory, because the programs need more |
riel | 2) replace a piece of bad memory with good memory ... but that could be done in hardware, with the hardware mirroring the bad memory to a piece of good memory and then letting the sysadmin pull the old DIMM |
riel | in this case "bad memory" would be memory that gets correctable ECC errors |
riel | so the data is still good |
riel | <hans> <riel> also, why would you ever want to remove memory from a system ? |
riel | <hans> maybe to exchange it with faster ram? |
riel | <franl> Or to upgrade to higher capacity DIMMs. |
riel | ok, two good points ;) |
riel | especially the higher capacity DIMM argument is a valid one |
riel | I forgot all about that |
riel | somebody from VALinux Japan is working on memory hot remove, btw |
riel | but he is running into the fundamental problems I just described |
riel | so his code patch only most of the time |
riel | also, he can only remove memory that has no kernel data in it |
riel | in short, for the 2.6 kernel you should only expect memory hot-add |
riel | hot-remove is very complex ... |
riel | ... |
riel | are there any other questions about the memory hotplug support ? |
riel | ok, then I guess this presentation is done ;) |
riel | thanks to the Umeet organisers for putting this event together |
riel | I know how much work it is and am thankful they organised another Umeet |
riel | I would also like to thank the translators, who are working hard to get talks translated (live!) into other languages |
riel | if you are still awake, I'd also like to thank the audience |
riel | it just wouldn't be the same if I was talking to myself ;) |
riel | thanks everyone, this Umeet was great again |