*** riel sets mode: +m |
* riel sets the channel to moderated |
riel | I guess it's better if we let Alan have a few minutes of rest before the talk |
riel | OK, welcome to this UMEET lecture |
riel | today Alan Cox will hold a talk titled |
riel | "Optimising for modern processors" |
riel | in order to keep this channel readable, we have set it +m |
riel | so only the ops can talk |
riel | if you have a question during the lecture, you can always ask it in #qc |
riel | you probably already know Alan Cox, who is one of the driving forces behind Linux kernel development |
riel | he is a man of many talents though |
riel | in fact, he even prepared slides for this talk |
riel | you can find those on: |
riel | http://www.linux.org.uk/~alan/Slides/ |
riel | Alan, go ahead when you're ready |
Alan | Ok |
Alan | This talk is partly about how modern processors work |
Alan | Mostly however its about why this changes the way you need to program to get the best performance |
Alan | By modern processors we really mean anything from the pentium onwards - in some ways from the 486 onwards |
Alan | Ten years ago a 40MHz processor was pretty fast. Today the same position is occupied by a 3GHz processor |
Alan | Memory has not increased speed to cope with this, and more importantly it has not improved in latency (the time from asking for a piece of memory to getting it) much at all |
Alan | The new processors also can execute multiple instructions each clock cycle, so in fact the processor might want to be accessing memory not 100 times faster as you might think from the clock rate change but near 500 times faster |
Alan | To deal with this processors added cache memory. It is possible to build systems which just have very fast memory but its incredibly expensive |
Alan | The sort of computer you find on your desktop today has a vry slow memory subsystem - things like 133MHz SDRAM and DDR ram have improved the data rate but not enough, and have done little to improve the access time for a given piece of data |
Alan | To give you an idea how slow main memory is compared with the cache I measured the copying speed of data in the on processor cache (called the L1 or level 1 cache) and the larger slower cache (The L2 or level 2 cache) |
Alan | On an Athlon the L2 cache was six times slower for copying than the L1 cache |
Alan | Main memory is eight times slower than the L2 cache |
Alan | So every piece of data you have to fetch from main memory you could have fetched fifty from the cache |
Alan | This makes keeping the right things in the cache extremely important, as well as knowing how the cache works so that you can understand what is needed to get the best use from it |
Alan | In 'real world' terms if your L1 cache was your desk and took 1 second to access your main memory (your filing cabinet say) would take one minute for each item you had to find |
Alan | The same things are true for pretty much all modern processors. The newer the processor quite often the larger the gap because the processor is getting faster more rapidly than the memory |
Alan | Worse still there are physical limitations on how fast the memory can go, and how quickly signals can travel acrosss the motherboard - the speed of light really is too slow nowdays |
Alan | The obvious question then is what is in the cache. If we know what is in our cache and what data it will keep we have some idea how we want to write our programs |
Alan | d33p] can the cache be directly manipulated by a programmer.. I |
Alan | would have thought it wasnt? |
Alan | d33p: in the normal cases you can't directly control the cache.. but you can understand how the cache will behave |
Alan | d33p: there are instructions on newer processors where you can help the cache along - but thats the last slide of the talk 8) |
Alan | The cache holds the most recently used code and data. So if you execute a loop the loop will end up in the cache |
Alan> similarly if you are looking at a list regularly the list contents will end up in your cache |
Alan | Because the cache is quite primitive in some ways the actual data it can store in each piece of the cache (each cache line) is quite restricted. |
Alan | The processor doesnt have time to look at all of the cache to see if a piece of data is already in the cache. Instead it breaks the address up into several pieces |
Alan | The upper bits of the address (address & ~4095) go to the memory management hardware to turn a virtual address into a physical addresss |
Alan | At the same time the lower bits are passed to the cache. The cache looks at the remaining bits (ignoring the lowest 4-6 depending on the processor) |
Alan | and it looks in two or four places to see if the data it needs is present. if it is then it uses this data, and cancells the work the memory management hardware is doing |
Alan | That limitation has some fun effects which I'll demonstrate later on |
Alan | Not all of memory is cached of course - it would be a bad idea if data was cached that was for your display and you didnt see the text because the processor had it |
Alan | init64] Alan_Q : some cache have another way to find lines. They |
Alan | use 1 "comparator" per cache line |
Alan | init64] but I guess it's too expensive for big amounts of cache |
Alan | init64: basically yes - the more complex an algorithm the longer it takes to run - even in hardware |
Alan | With a 3GHz processor you don't have very long to decide if something is in cache or not |
Alan | That is also one reason it is common to have a small fast L1 cache, and a larger slower (but smarter) L2 cache |
Alan | The processors normally deal with memory in chunks of 16, 32 or 64 bytes. |
Alan | So each piece of cache holds chunks of those sizes and aligned to that size. The chunks get bigger as the L1 caches get bigger generally |
Alan | All of that is loaded at the same time. When you ask an Athlon for a byte of data it will load the entire 64 bytes it wants. |
Alan | This means several things - one of which is that if you are going to use data, put the data you use together |
Alan | The kernel goes to great pains to put structures in an order where data that is used together is close together |
Alan | because if you loaded one bit of that data you have the rest anyway |
Alan | Ok now a first demonstration of why knowing about caches matter is demo1 (http://www.linux.org.uk/~alan/Slides/slide5.html) |
Alan | This is a very simple program that writes 4096 values into memory |
Alan | (we run it lots of times to get some numbers) |
Alan | We run this with the data spaced out on 1,2,4,8,16,32 and 64 byte boundaries |
Alan | just like updating an array of different sized structures |
Alan | tarzeau] how do you run demo1 in single user mode? |
Alan | tarzeau: it'll give you reasonably reliable answers if you just run it without too much else going on |
Alan | even multiuser |
Alan | So what demo1 does is much like a large number of perfectly normal applications. |
Alan | You'll notice that evenon a Pentium IV with very good memory and large caches then performance drops considerably |
Alan | ifvoid] how large are the P4 L1 and L2 cache? |
Alan | That varies depending upon whether it is a Xeon or not |
Alan | Those numbers are from a Xeon with 512K of L2 cache and I think 64K of L1 cache |
Alan | If you run the program on something like a Celeron then you would see a much more rapid reduction in performance |
Alan | you'd probably also want to change the for loop to do 100000 not 1000000 or you'll be waiting for it all night |
Alan | You can actually use techniques like this to find the properties of the cache on a processor |
Alan | We don't do that in Linux because the kernel knows how to ask the processor properly for the data (and puts much of it into /proc/cpuinfo) |
Alan | What this tells us is that if you are going to scan large blocks of data, you want the values you are scanning to be together in memory |
Alan | If we do 4096 comparisons of values close to each other we could be several times faster than if we looked at one field in each element of an array |
Alan | That demonstrates how important careful planning is. |
Alan | Of course in many cases you can use trees, hashes or other much more intelligent data structures to achieve the same results or better |
Alan | The second demonstration is designed to show something else |
Alan | On older processors it was very common to use lookup tables for things like division in 3D games. With a modern processor this isnt always so clear |
Alan | Demo2 finds out how long it takes to do a lot of divisions, then compares it with using a lookup table to do the same thing |
Alan | On the pentium4 once the lookup table exceeds 128K the performance is actually better by doing the maths - even though divide is an extremely costly operation |
Alan | This is because looking data up in main memory is actually more expensive than doing division |
Alan | (again on a slower box you may well want to make the loop somewhat smaller) |
Alan | [14-Dec:18:37 smesjz] Step of 1 across 4K took 50 seconds. (p133/2.5.49) :) |
Alan | smesjz: Gives you an idea how much processor performance has changed |
Alan | The division one is quite interesting because it depends heavily on the processor |
Alan | on something like a pentium the lookup table will be way way cheaper |
Alan | but by the time you reach the athlons and PIII/PIV it becomes a lot less clear |
Alan | smesjz] Alan_Q: it surprises me that 5 out of demo1 tests |
Alan | returns 50 seconds runtime (100k iterations) |
Alan | smesjz: To find out why you'd really have to look at what was going on more deeply - I don't know either, unless your memory is the real limit |
Alan | [14-Dec:18:41 rp] can someone please explain exact meaning of 1024 tablesize? |
Alan | rp: the different runs access data from lookup tables that are 64K, 128K and so on |
Alan | so 1024K lookup table means the program is simulating random accesses to 1Mbyte of lookup data |
Alan | if you run this with bigger and bigger sizes eventually the performance becomes about constant. That gives you a good idea that the cache is no longer helping out |
Alan | This paticular test is very important for things like image processing. jpeg compression and the like |
Alan | one of the reasons that MMX is such a help for video processing is that it lets you do a lot of processing at the same time, so you can avoid lookup tables when doing things like colour conversion |
Alan | ridiculum] is there in linux any tool like Vtune (intel) to |
Alan | debug thinks like alan is explain? |
Alan | ridiculum: There are two - there is an open source thing called "oprofile", and there is Intel vtune which is expensive and requires a second windows PC and other things |
Alan | ridiculum: you can look at a lot of the statistics because the newer processors have debugging registers |
Alan | they allow tools like oprofile to ask the processor "how many cache misses", "how often did you have to wait for data" |
Alan | and other similar questions. |
Alan | http://oprofile.sourceforge.net/ is the OProfile profiler |
Alan | So we've got some simple demonstrations of how important the cache is |
Alan | [14-Dec:18:48 avoozl] valgrind also might be interesting to look at, the |
Alan | latest version also can do cache simulation and show where in a |
Alan | program cache misses are occuring |
Alan | avooz: yes I had forgotten valgrind can do cache simulation too |
Alan | were there cpu's without cache? intels 80286? |
Alan | tarzeau: there were a lot. The 286 was almost never faster than the RAM it was attached too - similarly on the Amiga the RAM is almost twice as fast as the processor |
Alan | The amiga actually used that trick to give the processor and support chips shared accesss |
Alan | processor and chipset having alternate access |
Alan | So what have we learned about the cache and making good use of it |
Alan | Well - we know that when we get data we get it in chunks so we can put things we use in the same place. |
Alan | That helps the processor and also bappens to help virtual memory (when you swap data to disk you do so in 4K chunks so you may as well keep data together for that too) |
Alan | We've demonstrated that you want to keep your processing fitting within the cache. One reason Intel sell expensive processors with very large caches is that databases find it hard to do this |
Alan | so the Xeons and the really expensive pentium-pro with 1Mb caches were good for database work |
Alan | We also know that only a certain amount of data at a given alignment can be cached |
Alan | The kernel actually uses special memory allocators to try and scatter objects the kernel uses onto different alignments specifically because of this |
Alan | tarzeau] those edo memory sticks were 60ns and 70ns, how ns are |
Alan | l1 and l2 caches? |
Alan | tarz: for modern processors Im not actually sure - even on the 486 L2 cache was about 16nS |
Alan | If you have an array of objects that are power of two sized you are likely to be getting almost worst possible performance from the caches |
Alan | so its a useful trick to add a little extra unused space to each block of memroy to pad out the array elements so they cache better |
Alan | Ok time for the next demonstration |
Alan | http://www.linux.org.uk/~alan/Slides/slide7.html |
Alan | (there isnt a demo3.. I took it out to make the talk ift time better 8)) |
Alan | This is designed to show how different ways of doing something can have different performance because of the caches |
Alan | what it actually does (adding numbers) is fairly trivial, but its not that unlike real programming examples |
Alan | The first run generates a large set of data, and then adds it up. Generating sets of data then processing them, then processing the results is a very common way of programming |
Alan | but it can actually give the worst possible behaviour |
Alan | The second run we add the data up as we generate it, and get much better performance. |
Alan | This is mostly because the first run we end up emptying all the data out of the cache and then loading it back in again |
Alan | In the second case because we add as we go the data only ever leaves the cache once |
Alan | The final case shows how much of the operation is the actual overhead |
Alan | What this means is that for any large amount of data and computation it is important to work on it in chunks. |
Alan | Engineers and high performance computing people do this all the time - the GIMP knows about it too |
Alan | Many things the GIMP does in its filters it does using rectangles of the image rather than applying each change to the entire image one after another |
Alan | debUgo-] Alan_Q: how much affects cache associativeness in |
Alan | general memory performance? |
Alan | debugo: keeping data in cache makes a real difference to overall performance - mostly on SMP systems, which is where the next few slides go |
Alan | There are lots of algorithms for this and the same techniques are actually uses for clusters and beowulfs - only they are trying to minimise messages over ethernet so its much much more important than on a single system |
Alan | All of this stuff about caching matters much much more when you have a multiprocessor PC |
Alan | Less so on the bigger alpha and sparc machines because they have memory systems designed for multiple processors |
Alan | A dual athlon or dual pentium III/IV however is two processors on the same memory bus |
Alan | The 3 demonstrations have already shown that with a single processor the memory performance is not up to the processor |
Alan | So a dual processor machine gives us twice the problem |
Alan | One of the demonstrations you can do is to run a continuous large memory copy on one processor and time performance of copis on the other - on some dual PC machines the copies being timed will perform at 1/3rd of the speed they run without the other copying loop |
Alan | ifvoid] Alan_Q: won't that change for the Hammer and Itanium 3? |
Alan | ifvoid: hammer lets you attach memory to each processor, the more processors you add the more memory controllers you can add |
Alan | ifvoid: it depends what the cost of that is whether vendors will do it |
Alan | rp] so does that mean memory performance does not depend only on |
Alan | processor but also on bus speed? |
Alan | rp: yes |
Alan | unlevel0] <Alan>So a dual processor machine gives us twice the |
Alan | problem :so this explains why we do not get 2x the performance of 1 |
Alan | porcessor |
Alan | runlevel0: there are two reasons you don't get twice the performance |
Alan | the first is that you are sharing a memory which is not fast enough |
Alan | the second is that there is a cost in stopping the system from doing the wrong two things at the same time |
Alan | the kernel has to do real work to stop two people allocating the same memory, using the same disk block and all the other things we dont wish to happen |
Alan | The only reason a dual processor PC is usable at all is because most memory accesses are coming from the cache in normal usage |
Alan | each processor has its own cache (except some dual pentium machines which are just painful 8)) |
Alan | ridiculum] what about hyperthreading and cache coherence? |
Alan | ridiculum: hyperthreading shares the cache between the two execution units on that processor |
Alan | so you get to do two things at once but each application will suffer more cache misses |
Alan | sh0nX] so this is where spinlocks come in |
Alan | sh0nX: right - thats the main thing the kernel uses to synchronize things internally |
Alan | One thing the processors have to do is to ensure that the two processors dont cache different versions of the same data or miss changes the other processors make |
Alan | docelic] Id appreciate more on spinlocks too |
Alan | doc: we'll talk about that a bit after the main talk |
Alan | bzzz] Alan_Q: how pci devices may see data which in cache only? |
Alan | bzzz: the processors as well as making sure they see each others changes do the same with devices on the PCI bus |
Alan | bzzz: The standard caching technique is a thing called MESI |
Alan | That stands for the four states each piece of the cache can be in |
Alan | We have an "M" - or modified state. That means this piece of information is something this processor has changed and that we have data the other processors dont know about yet |
Alan | We have an "E - Exclusive state - where we know nobody else has this data but we do |
Alan | We have an "S" or shared state, where we know we have a copy of the data but other people also have copies in shared state |
Alan | nobody has it modified |
Alan | and we have "I" or Invalid - where we don't have a clue what is going on but we know we don't have the data |
Alan | At any point two processors cannot have the same data except in shared state |
Alan | When we modify some data we change the state on it - if it ws exclusive it becomes modified |
Alan | If it was shared we have to kick each of the other processors and make them get rid of their copies |
Alan | If we dont have a copy (I) we must ask for it - this like moving from shared can be quite expensive |
Alan | If another processor had a copy in modified state we have to ask that processor to write it back to memory and then read it ourselves |
Alan | What we want to avoid at all costs is having two processors continually modifying the same data |
Alan | This turns into a sort of food fight on the memory bus |
Alan | and we spend most of our time passing data back and forth between the processors |
Alan | That doesn't get a lot of work done - and once you want to scale to big computers it becomes very important indeed to avoid it |
Alan | IBM have been doing a lot of work on kernel code where these kind of fights can occur as they have 16 processor systems |
Alan | which make it very apparent when you get this wrong |
Alan | [14-Dec:19:17 sh0nX] Alan_Q: so we should be using the cache for SMP |
Alan | processors to keep data that isnt going to change much and use the |
Alan | processors to handle data that does change often? |
Alan | shonX: there are systems where it makes sense to have heavily shared data uncached. The way the PC hardware works really stops you doing this |
Alan | Even if you make it uncached it is still slow |
Alan | Most of the time this is not a problem - applications dont share a lot of things anyway |
Alan | Threaded applications tend to share very little data thankfully |
Alan | When you design threaded an SMP applications it is important to minimise the amount of time data spends bouncing between processors |
Alan | So for example if you were doing JPEG encoding on a multiprocessor system it would be better to use one processor to do the top half of the image and the second processor to do the bottom half |
Alan | than to have one processor do colour conversion and the other processor do the compression pass |
Alan | when it comes to things like mpeg encoding this gets quite tricky |
Alan | In addition it is possible to get what is called "false sharing" |
Alan | h0nX] Alan_Q: so we want to keep both processors doing OTHER |
Alan | things |
Alan | shonx: exactly |
Alan | shonx: like people processors work best when they are not falling over each other |
Alan | False sharing occurs because the processor cache works in 32 or 64 byte chunks |
Alan | If you happen to put two unrelated pieces of data in the same 64 bytes you might accidentally have one thing used by each processor in the same cache line - and start a fight |
Alan | Thus people pad out such structures to make them bigger and avoid this |
Alan | or they keep them apart |
Alan | (padding them out avoid sharing but it means you use more cache of course - so you are doing what demo1 said not to do) |
Alan | sh0nX] Alan_Q: so when designing SMP applications, how do we |
Alan | tell which processor to handle which data without causing the |
Alan | processors to both handle the same data? |
Alan | shonx: the scheduler tries to keep a given thread running on the same processor as much as possible |
Alan | so its just a matter of avoiding accessign the same data a lot in two different threads |
Alan | Similarly we try and keep a given application running on the same processor so that we dont spend a lot of time copying stuff from one processor to another |
Alan | [14-Dec:19:25 yalu] Alan_Q: is the scheduler smart enough to keep threads who |
Alan | share a lot of data on the same processo |
Alan | yalu: it makes some simple guesses |
Alan | but its actually very hard to measure the real amount of shsring efficiently |
Alan | espcially since read only sharing (eg of code) is fine |
Alan | zwane] Alan_Q: All this must get really interesting with |
Alan | Hyperthreaded cpus |
Alan | zwane: There are reasons Ingo is still fiddling with getting the best performance off such processors 8) |
Alan | With hyperthreading you sort of have two processors per cache |
Alan | and the cache has some other odd internal limits too |
Alan | zwane] Alan_Q: do you reckon scheduler only would suffice? How |
Alan | about leveraging cpu affinity for say doing bias in interrupt |
Alan | handling? |
Alan | zwane: There are good arguments in some cases for having a process wake up on the CPU that handled an interrupt. In most cases however it isnt anything like as valuable as you would think |
Alan | most of the process data is cached on the cpu that last ran it |
Alan | Most good I/O devices used DMA - so they wrote to memory themselves and the memory they wrote to they have removed from all the processor caches (since they modified it) |
Alan | there are good reasons for sticking interrupts ot specific processors |
Alan | (if processor 1 has all the data for eth0 cached then why handle the interrupt on processor 2) |
Alan | sh0nX] so, if a program is written for UP, how does the kernel |
Alan | scheduler handle its data on two CPUs? or it can't |
Alan | sh0nX: the scheduler can't split up something with only one thresd of execution. It can spread different applications around - so it can run your game on one processor and the X server on the other |
Alan | sarnold] Alan_Q: does linux currently have a mechanism to |
Alan | specify that all interrupts should be handled by a specific [set of] |
Alan | CPUs? |
Alan | sarnold: it has some stuff that Ingo did, its at the obscure and wonderous end of kernel tuning |
Alan | sh0nX] I see, so we have to use threads in our code in order to |
Alan | benifit SMP |
Alan | sh0Nx: or two programs sometimes is just as easy or easier |
Alan | There is one last subject for this talk, then we can move onto most of the questions |
Alan | Someone asked early on about helping the cache out |
Alan | On a modern processor you have instructions like "prefetch" and "prefetchw" |
Alan | These allow you to tell the processor you will be needing data in the future |
Alan | So instead of getting stuck waiting for data to arrive from memory you can tell the processor in advance |
Alan | The big problem with this is you often don't know well in advance which memory you will need |
Alan | A memory copy is easy - and the Athlon memory copy in Linux actually keeps saying "and prefetch me 320 bytes ahead of this point" |
Alan | Similarly things like graphics processing benefit immensely as do programs that use large arrays of data in predictable fashions |
Alan | (fortran does very well here strangely enough |
Alan | We use this in the krnel for memory copies and some times for lists |
Alan | it is hard to use for lists because memory is so slow you want to say "prefetch me about five or six items ahead" |
Alan | <translator wait> [prefetch me a translator ;)] |
* riel dcc's the crowd some virtual beers |
Alan | Ok translators fingers seem to have caught up |
Alan | What we actually need to make this sorto fthing work is new data structures |
Alan | one of the common approaches is to have lists which know next/previous but also know 'five items on' and 'five items back'. We don't do this in the kernel currently |
Alan | but it may be something we must look at in the future as processors get faster still |
Alan | The final useful thing prefetch is used for in the kernel makes use of the Athlon 'prefetchw' which says "I want this data soon, and I will write to it" |
Alan | unlike prefetch this gets an exclusive copy of the data. We use this for prefetching locks - which is something that is very expensive if it has to go to main memory |
Alan | It is very common for a lock structure to belong to another processor and we often know the lock is going to be used so can prefetch it |
Alan | sh0nX] I assume we use some sort of spinlock to prevent another |
Alan | processor from prefetching the same data? |
Alan2 | uggh.. lag 8( |
Heimy | mmh... |
Heimy | 19:44 <Alan> sh0nX] I assume we use some sort of spinlock to prevent another |
Heimy | 19:44 <Alan> processor from prefetching the same data? |
Alan2 | We don't actually lock that |
Alan2 | very occasionally we prefetch it and it is stolen by another cpu then fetched back again |
Alan2 | it happens so rarely it is cheaper not to worry |
Alan | Ah .. back again |
Alan2 | or not as they case may be |
Alan2 | Also if you had a lock for the lock - you would want to prefetch for the prefetch |
Alan2 | and so on repeatedly |
Alan2 | So in the kernel we treat prefetch very much as a hint |
Alan2 | if it does the right hting most times then it is fine.. |
Alan2 | Ok that is really the end of the main part of the talk |
Alan2 | hopefully it has given people some ideas of why caches matter |
Alan2 | and a bit about programming with them in mind |
Alan2 | If we can start with on topic questions before we wander off that would be best |
riel | I guess people should ask the on topic questions in #qc |
riel | so we can leave #linux moderated for a few more minutes |
Alan2 | sarnold:#qc] Alan2: i've wondered if prefetching cuts memory |
Alan2 | bandwidth significantly.. have people tested with prefetch config'ed |
Alan2 | away? |
Alan2 | sarnold: we've done a fair amount of testing. Most of the time prefetching actually helps use memory bandwidth that would otherwise be wasted |
Alan2 | The athlon one was so fine tuned that we broke some VIA chipsets due to a hardware bug though 8) |
Alan | rene:#qc] Alan2: talk seemd to be about cacheing alone. do |
Alan | things like instruction alignment make a lot of difference om modern |
Alan | processors? |
Alan | rene: they matter a bit - it depends on the processor how much. gcc does know how to get these right when you pick a processor type. Normally however it is under 1% |
Alan | Arador:#qc] Alan2: what're the effects of preempt on caching? |
Alan | arador: the more you switch bewtween tasks the less useful the fache gets |
Alan | s/fache/cache |
Alan | Pre-empt doesn't really make a lot of difference |
Alan | It is however why systems designed for a lot of simultaneous users have a lot of cache |
Alan | [14-Dec:19:55 aka_mc2:#qc] ALAN: do you think that Crusoe processor, Linux |
Alan | supported, it will be considered for all these programmation techs?? |
Alan | aka_mc2: Crusoe is very hard to deal with - the system emulates an x86 and it adjusts its emulation according to things it learns at runtime. That means it can learn what seems to need prefetching and many other things a normal processor cannot. How much of that it actually does I don't know. |
Alan | sklav:#qc] i have noticed higher load averages after i use a |
Alan | kernel with -03 and or -05 |
Alan | sklav: much of that is actually cache related - gcc -O3 and -O5 unrolls loops which makes them use a lot more memory and on modern cpus is a bad thing to do |
Alan | really it is a bug in some gcc's that it does this too much |
Alan | jmgv:#qc] Alan? dont you think a lot of the work about registers |
Alan | users and other questions depend of the compiler and that made us |
Alan | lose some control about those issues? |
Alan2 | jmgv: true - but do you want to hand optimise one megabyte of code ? |
Alan2 | jmgv: for the krnel we actually write small critical pieces of code in assembler in som cases - things like memcpy for example |
Alan2 | there are other bits where the C is written so that the compiler outputs the right code rather than the obvious way |
Alan2 | Rapiere: If GCC improves one |
Alan2 | thread cache use, won't this spoils multi-threading interactivity ? |
Alan2 | Rapiere: the scheduler is dealing at a much higher level - and the decisions it makes which are designed for best cache performance are the right ones anyway fortunately |
Alan2 | sapan:#qc] Alan2: you said "we know that only memory of certain |
Alan2 | sizes at certain offsets can be cached" could you explain? |
Alan2 | The processor uses parts of the address to indicate which bit of the cache to look in |
Alan2 | To the CPU an address really looks like [Page Number][Cache line][index into cache] |
Alan2 | So the cache always caches on a 64 byte boundary on an athlon |
Alan2 | In addition if we have lots of date with the same cache line number we can only cache two or four of those bits of the data |
Alan2 | the cache can't store any block of data in any place |
Alan2 | Ok shall we go onto more general questions for a bit (Rik when is the next talk scheduled ?) |
riel | Next talk will be tomorrow at 1800 UTC |
Alan2 | coywolf!jack@210.83.202.168* what do you think windows GUI is |
Alan2 | far faster than linux GUI? |
Alan2 | coywolf: because they didnt attend my lecture 8) |
Alan2 | coywofl: but you should go try xfce/rox even on a 32Mb PC 8) |
Alan2 | sh0nX:#qc] since we're offtopic now: Alan2: Do you have patch |
Alan2 | for the amd76x_pm module for 2.5.xx? |
Alan2 | shonx: it shouldnt be very hard to port but I dont think anyone has ported it yet |
Alan2 | shonx: cool |
Alan2 | (sh0nX:#qc] Alan2: im trying to port it right now) |
Alan2 | sapan:#qc] Alan2: I have an iPAQ with familiar running |
Alan2 | 2.4.18-rmk - if I were to optimize things in the kernel/apps in |
Alan2 | general, what should I be looking at? |
Alan2 | sapan: Im actually not that familar with the ARM internals. The same general things should apply |
Alan2 | sapan: obviously there are other considerations on a handheld too - lack of a disk, power saving etc |
Alan2 | E0x:#qc] Alan2 what is you prefer procesor ? |
Alan2 | EOx: this varies. I love the raw speed of the Athlon but hate the reliability and the heat problems |
Alan2 | At the moment I am playing with VIA C3/VIA Eden processors - which are quite slow but are designed to be very power efficient - no fan needed |
Alan2 | this makes for very quiet and cheap systems |
Alan2 | plus small boards people can do crazy things with - like put them into old sparc boxes, or even a gas can |
Alan2 | (www.mini-itx.com) |
Alan2 | ridiculum] what's your opinion about itanium2? it's better than |
Alan2 | hammer? |
Alan2 | ridiculm: right now I am better firmly on the hammer |
Alan | As to why the athlon reliability is a problem Im not sure - I've had real problems with getting reliable memory on the dual athlon, heat problems and a lot of hardware incompatibility |
Alan | but it does go awfully fast once it works |
Alan | apuigsech:#qc] Alan, on GDT table we can find some nul |
Alan | decriptors (not used), ?is that to gain optimization on cache memory |
Alan | usage? |
Alan | apui: actually several of those gaps are because we used to use them for things and wanted to keep some data the same, others have fixed values required by standards, or for windows bug compatibility in the bios - so not the cache this time |
Alan | rene] (so that CPUS don't trample on ech others cache lines) |
Alan | rene: we have to space some things out for that |
Alan | rene: One example is that the kernel has a structure that describes each page of memory |
Heimy | rene] (de manera que CPU no machaquen las líneas de caché de otros) |
Alan | Various people went to great pains to make that structure exactly 64 bytes ona PC |
Heimy | ooops |
Alan | sh0nX] Alan: do you visit #kernelnewbies? :) |
Alan | shonx: not oten enough - its a really important project |
jmgv | <davej> folks interested in the prefetching stuff Alan talked about may find the presentation at http://208.15.46.63/events/gdc2002.htm interesting |
Alan | sh0nX] Alan: do you visit #kernelnewbies? :) |
Alan | shonx: not oten enough - its a really important project |
riel | ok, the questions seem to be slowing down |
riel | I guess it's time to wrap up the "official" part of this talk |
riel | I'd like to thank Alan for this interesting talk |
riel | and I'd like to remind everybody else of the other UMEET lectures we'll still have |
jmgv | we thanks alan cox his effots |
riel | you can see the full program here http://umeet.uninet.edu/umeet2002/english/prog.eng.html |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
riel | clap clap clap clap clap clap clap clap |
riel | clap clap clap clap clap clap clap clap |
Ston | clap clap clap clap clap clap clap clap |
Ston | clap clap clap clap clap clap clap clap |
jmgv | clap clap clap clap clap clap |
rp | clap clap clap |
jmgv | clap clap clap clap clap clap |
sh0nX | clap clap clap clap clap clap |
jmgv | clap clap clap clap clap clap |
angelLuis | plas plas plas plas plas plas plas plas plas |
Ston | clap clap clap clap clap clap clap clap |
sh0nX | clap clap clap clap clap clap |
angelLuis | plas plas plas plas plas plas plas plas plas |
angelLuis | plas plas plas plas plas plas plas plas plas |
sh0nX | clap clap clap clap clap clap |
jmgv | clap clap clap clap clap clap |
mulix | clap clap clap clap clap clap clap clap |
mips | hahaha |
angelLuis | plas plas plas plas plas plas plas plas plas |
angelLuis | plas plas plas plas plas plas plas plas plas |
angelLuis | plas plas plas plas plas plas plas plas plas |
Ston | clap clap clap clap clap clap clap clap |
mulix | clap clap clap clap clap clap clap clap |
mulix | clap clap clap clap clap clap clap clap |
rp | clap clap clap |
casanegra | clap clap clap |
varocho | clap clap clap |
bit0 | clap clap clap clap |
apuigsech | x) |
angelLuis | plas plas plas plas plas plas plas plas plas |
Ston | clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
angelLuis | plas plas plas plas plas plas plas plas plas |
Ston | clap clap clap clap clap clap clap clap |
NiX | clap clap clas clap clap |
Ston | clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
sh0nX | clap clap clap clap clap clap |
angelLuis | plas plas plas plas plas plas plas plas plas |
jacobo | clap |
ms | clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
jacobo | clacp |
sarnold | clap clap clap clap clap :)) |
sarnold | clap clap clap clap clap :)) |
rp | great one |
angelLuis | plas plas plas plas plas plas plas plas plas |
sarnold | clap clap clap clap clap :)) |
bit0 | plas plas plas |
sarnold | clap clap clap clap clap :)) |
sarnold | clap clap clap clap clap :)) |
jeffpc | clap clap clap clap clap clap clap clap clap clap clap |
HPotter | plas plas plas |
MJesus | clap clap clap clap clap clap clap clap clap clap |
jeffpc | clap clap clap clap clap clap clap clap clap clap clap |
jeffpc | clap clap clap clap clap clap clap clap clap clap clap |
angelLuis | plas plas plas plas plas plas plas plas plas |
jeffpc | clap clap clap clap clap clap clap clap clap clap clap |
casanegra | clap clap clap |
mulix | *-* *-* *-* *-* *-* *-* |
mulix | *-* *-* *-* *-* *-* *-* |
angelLuis | plas plas plas plas plas plas plas plas plas |
Ston | clap clap clap clap clap clap clap clap |
mulix | *-* *-* *-* *-* *-* *-* |
_Josh_ | alan rules!!! |
Ston | clap clap clap clap clap clap clap clap |
Geryon | great talk :) |
Ston | clap clap clap clap clap clap clap clap |
Karina | clap clap clap y mas clap :) |
MJesus | clap clap clap clap clap clap clap clap clap clap |
Chico | plas plas plas plas plas plas plas plas plas plas plas plas plas plas plas |
NiX | plap plap plap plap plap plap plap plap plap plap plap plap plap plap plap plap |
angelLuis | torero! bravo!!!! |
error27 | clap clap clap |
Baldor | clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
angelLuis | torero! bravo!!!! |
sh0nX | clap clap clap clap clap clap (2 more times) |
drizzd | clap clap clap clap clap clap clap clap clap |
sh0nX | clap clap clap clap clap clap (2 more times) |
Chico | plas plas plasplas plas plasplas plas plasplas plas plasplas plas plasplas plas plasplas plas plasplas plas plas |
sh0nX | clap clap clap clap clap clap (2 more times) |
BorZung | plas plas plas plas plas plas plas plas plas |
_Yep_ | thanks |
mcp | oh what braindead people |
angelLuis | plas plas plas plas plas plas plas plas plas |
ibid | clap clap clap |
angelLuis | plas plas plas plas plas plas plas plas plas |
BorZung | plas plas plas plas plas plas plas plas plas |
casanegra | clap clap clap |
BorZung | plas plas plas plas plas plas plas plas plas |
sapan | clap clap |
EleTROn | VIVA Alan ! |
BorZung | plas plas plas plas plas plas plas plas plas |
BorZung | plas plas plas plas plas plas plas plas plas |
BorZung | plas plas plas plas plas plas plas plas plas |
BorZung | plas plas plas plas plas plas plas plas plas |
NiX | félicitations! |
angelLuis | :)) |
MJesus | clap clap clap clap clap clap clap clap clap clap |
Baldor | clap clap clap clap clap clap clap |
Chico | Bien |
MJesus | clap clap clap clap clap clap clap clap clap clap |
angelLuis | plas plas plas plas plas plas plas plas plas |
pask | docelic JUAS JUAS JUAS |
EleTROn | Alan no Forum Internacional de Software Livre no Brasil 2003 |
Heimy | clap clap clap clap clap clap clap clap clap clap |
EleTROn | VIVA |
Heimy | (sorry, I was translating) :-) |
mcp | sarnold: hehe |
coywolf | |
sarnold | ... if only alan hadn't had lag problems... i guess NTL hasn't fixed all his problems. :( |
sh0nX | hehe |
Ston | errr |
rp | is alann coming back |
MJesus | ¡¡Viva Alan!! |
rp | s/alann/Alan |
Ston | donde esta el ? |
sh0nX | hey now |
casanegra | nu ce :S |
sarnold | I'd like to mention that Milton's Cisco presentation has been replaced by james morris; he will be presenting on the new 2.5 kernel cryptography support |
angelLuis | se ha perdido los aplausos??? |
raciel | good talk Alan! |
mips | EleTROn: cheguei e o cara terminou de falar |
mips | hahahaha |
sh0nX | :-) |
mips | lixo |
EleTROn | <mips> hauhauaha |
mips | só vi a msg agora. |
riel | MJesus: fast action |
mips | q o barbanegra me mando |
sarnold | MJesus: nice :) |
sh0nX | I'd like to thank the UMEET people for getting Alan to speak today :-) |
EleTROn | <mips> tava tri |
angelLuis | MJesus: very good!! |
sh0nX | it was very informative, and I learned a lot more about SMP :) |
rp | clap for UMEET |
rp | clap for UMEET |
rp | clap for UMEET |
Chico | very nice, Mª Jesus |
mips | EleTROn: não vou morrer por isso =) não morro de amores por esses locos |
EleTROn | mips: nem eu :) |
angelLuis | hurra for UniNet.edu!!!!!! |
angelLuis | hurra for UniNet.edu!!!!!! |
angelLuis | hurra for UniNet.edu!!!!!! |
* riel knows the netmask of the real alan |
sh0nX | heh |
Ston | riel: where is Alan ? |
angelLuis | riel: :)) |
sh0nX | riel: i think it was visible before |
* rp does not know netmask of real alan |
mulix | imitation is the sincerest form of flattery |
sarnold | Ston: probably ping timeout :( |
riel | Ston: at home, probably eating something now |
sh0nX | but im not going to mention it |
sarnold | mulix: except in the case of coywolf :-/ |
jacobo | mulix: it depends on the quality of the imitation ;) |
Ston | jejeje ok =) |
freddy | Hay alguien de México aqu� |
riel | he must be hungry after two hours of presentation |
Megatron | yo |
jacobo | bye |
rp | how long was the *full* presentation |
freddy | NNo eres por casualidad David Limon? |
debUgo- | talking makes him thirsty? |
juan | has the conference finished? |
sarnold | rp: about 2.25 hours |
debUgo- | X) |
mips | EleTROn: que é? |
bit0 | juan: yes |
riel | debUgo-: dunno about Alan, but it usually works for me ;) |
sarnold | juan: alan's presentation is over, but there is still one more week of uninet presentations. :) |
debUgo- | heheh |
sh0nX | :) |
*** Zeno (fltak@zeno.student.utwente.nl) Quit (Lost terminal) |
Ston | promedio de personas en el canal durante la charla era de 260 personas jejeje xD |
Ston | numero tope que vi 280 xD |
debUgo- | riel: at least that you speak as you type (too) heh |
Megatron | freddy sip |
jmgv | really good |
Heimy | Well. |
debUgo- | that would be funny |
Heimy | I dunno if he's thirsty |
MJesus | and in #redes more than 100 aditional peoples |
Heimy | But his wrists should be on pain right now :P |
mips | huh |
Ston | MJesus: 123 ;-) |
rp | who got Alan to give this talk? |
Ston | uh 132 :) |
riel | Heimy: that happened to me, after my presentation |
drizzd | Heimy: you can tell, hmm? |
Heimy | :-)) |
riel | Heimy: I just had to go away from the keyboard for a while ;) |
Heimy | drizzd: Me? Why? |
sh0nX | riel ;-) |
MJesus | for traslartor: |
drizzd | Heimy: because you had to type as much as did |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
Heimy | I only translated half of his presentation :-) |
drizzd | s/as/he |
pask | its enough? |
jmgv | rp: umeet got Alan. at umeet dont exist individuals, umeet is a group |
MJesus | traslator to Spanish: arador, jacobo and heimy (with vizard) |
pask | clap clop clup |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
MJesus | clap clap clap clap clap clap clap clap clap clap |
|
|