tarzeauwhy did alan remove the "printer is on fire" joke from the kernel?
zwaneit might be worth people also looking at stuff like http://www.amd.com/us-en/assets/content_type/white_papers_and_tech_docs/22007.pdf
viZardobiwan, yes, is ok
sh0nXAlan_Q: How do we compare Athlon XP's to MP's with optimization? Since they are very close to being identical processors?
muli          speed of light really is too slow nowdays --- snarfed by the sigmonster
d33pcan the cache be directly manipulated by a programmer.. I would have thought it wasnt?
zwaned33p: yes, we can prefetch, forced invalidate etc
d33pAlan_Q: okay
obiwanAlan_Q : When we compile a program to be optimized for a specific processor, it won't work on lower-end processors (if I'm not mistaken). So if we prepare a binary to be used on many computers (say when making a distro CD), is it always necessary to take the lowest common denominater (i386 perhaps)?
Jimzybut since aching technology undergoes changes, is it really smart to program based on how the cache works? since it could change?
muliobiwan, yes... here the problem is really asm instruction families
rielJimzy: it will always remain "too small", some things never change
SnakeFoocAlan_Q: some recommendation for the novices of linux?
muliif the old processor does not support some of the instructions in the binary, because you compiled for a newer cpu, you're out of luck
muliso you need to compile for the lowest common denominator - i386 in the x86 case.
obiwanbut if we want to run processor-intensive stuff, it would be wise to squeeze every ounch of performance from the latest and greatest processor right? that is the point of optimaztion, if I'm not mistaken
rielSnakeFooc: not related to the topic of the lecture, please ask afterwards
tarzeauhow are these blocks for sparc or powerpc comparing to x86?
tarzeauwhat about dos? that runs not in protected mode, but linux is in pm
tarzeaudoes it matter?
rpi am running demo1 with ./a.out and cannot see anything printed
d33prp: same here
ifvoidhow large are the P4 L1 and L2 cache?
rpd33p wait i got it
Alan_Qrp: you may need to wait a while or make the loop smaller for slower processors
tarzeauifvoid: i wonder how's it for sparc/ultrasparc and powerpc
runlevel0[ Alan ] So what demo1 does is much like a large number of perfectly normal applications... buf, I would better not run it right now, compiling the X ;)
tarzeauifvoid: g3 and g4 that is
rpStep of 1 across 4K took 80 seconds.
rpthis is what I got
sh0nXAlan_Q: What is the best/average number of hits in (%) the kernel can get with the processor caches?
davejifvoid: http://www.codemonkey.org.uk/x86info/results/Intel/pentium4-northwood-HT.txt
Alan_Qrp: make it run 1/0th of the number of times 8)
Alan_Q1/0 -> 1/10
rpAlan_Q How???
sarnoldrp: remove one of the zeros from the for loop
d33pAlan_Q: which loop do we make smaller?
sarnoldd33p: (the inner for loop)
rpyes I have removed a zero and recompiled
rpStep of 1 across 4K took 8 seconds.
fetsStep of 32 across 128K took 4 seconds.
fetsStep of 64 across 256K took 10 seconds.
fetsStep of 128 across 512K took 36 seconds.
fetsthis is a xeon ;-)
rpI have AMD k6-II
sh0nXStep of 1 across 4K took 29 seconds.
sh0nX(in X with KDE) so its gonna take longer
Aradorsh0nX: what machine?
sh0nXAthlon MP 2000+
docelichere, all up to 64k takes 5 secs, then 128: 7 sec, 256: 19 sec
sh0nXUP right now
sh0nXStep of 2 across 8K took 26 seconds.
sh0nXStep of 4 across 16K took 25 seconds.
smesjzStep of 1 across 4K took 50 seconds. (p133/2.5.49) :)
sh0nXStep of 8 across 32K took 28 seconds.
sh0nXStep of 16 across 64K took 26 seconds.
sh0nXStep of 32 across 128K took 32 seconds.
sh0nXim sure it would be lower in single user mode
coredsh0nX do you test the demo2.c program?
tarzeaucan someoen run these tests on powerpc/sparc/ultrasparc?
rpReference run one took 29 seconds.
sh0nXReference run one took 4 seconds.
sh0nX64K table size took 5 seconds.
sh0nX128K table size took 7 seconds.
paranoiddis it possible to use the Linux SMP implementation of Intel multiprocessor as base for an implementation of asynchronous MP? (e.g. using the sound card's DSP)
sh0nX256K table size took 15 seconds.
coredmy pentium mmx is slow :(
sh0nX512K table size took 19 seconds.
coredi have to change the loop to 100 i think
sh0nXif I run this on my Pentium 233MMX
sh0nXit'll be much slower
smesjzAlan_Q: it surprises me that 5 out of demo1 tests returns 50 seconds runtime (100k iterations)
sh0nX1024K table size took 22 seconds.
sh0nX2048K table size took 22 seconds.
sh0nX4096K table size took 21 seconds.
zwaneparanoidd: nope
rpcan someone please explain exact meaning of 1024 tablesize?
Heimytarzeau: maybe we could use vore for the sparc test. It's doing nothing right now }:)
rpi mean tablesize
sh0nX4096 too LESS then 2048?!
tarzeauHeimy: vore! heimy! that's debian :)
Heimytarzeau: yep O:)
SnakeFoocAlan_Q:   Help   SYSTRAN - Internet translation technologies
SnakeFoocalive in Argentinean asi that is average difficult to buy a pentium to me IV this the price to clouds
rpshall I assume that the lookup tables are in main memory
SnakeFoocAlan_Q: alive in Argentinean asi that is average difficult to buy a pentium to me IV this the price to clouds
HeimySnakeFooc: erm...
Alan_Qsnake: they arent cheap here either
ifvoidtarzeau: this is demo1 on alpha:
ifvoidStep of 1 across 4K took 4 seconds.
ifvoidStep of 2 across 8K took 4 seconds.
ifvoidStep of 4 across 16K took 4 seconds.
ifvoidStep of 8 across 32K took 1 seconds.
ifvoidStep of 16 across 64K took 2 seconds.
ifvoidStep of 32 across 128K took 8 seconds.
Alan_Qits not my PIV 8)
ifvoidStep of 64 across 256K took 11 seconds.
ifvoidStep of 128 across 512K took 10 seconds.
sh0nXifvoid: in single user?
ifvoidsh0nX: no
sh0nXin X?
psyifvoid, whats your processor?
BorZungStep of 16 across 64K took 19 seconds.
BorZungStep of 32 across 128K took 151 seconds.
ifvoidpsy: ev6 I think
SnakeFoocit was to me my keyboard
ifvoidbut it's a 4-proc machine, with a load of about 2 atm
psymine took 27 secs at all steps
ifvoid(I removed one 0 btw)
sh0nXAlan_Q: that explains why higher buffers took the same amount of time on my machine.
ridiculumis there in linux any tool like Vtune (intel) to debug thinks like alan is explain?
psyifvoid, how?
ifvoidpsy: ?
psyyou removed one 0
ifvoidfrom the inner loop
psyhow did u?
sh0nXoh :)
ridiculummaybe explaning ? (my english it's not good)
ifvoidso, multiply the run times by 10
psyi see heh
SnakeFoocridiculum . ke pasa
SnakeFoocAlan_Q: alive in Argentinean asi that is average difficult to buy a pentium to me IV this the price to clouds
ifvoidhmm, fiddling with the compiler options makes a lot of difference for demo1
sh0nXAlan_Q: I always thought having data aligned on 512/1024, etc was a good thing, since those are common alignments and should be easier for the processer to split things into?
ifvoidI wonder why
sarnoldSnakeFooc: alan responded .. he said they are expensive for him too
Alan_Qifvoid: always use -O2 or -O3 for those tests, you want to measure the cpu not the compiler 8)
rpi stilll don't get what *eactly* are lookup tables...
sapanAlan_Q: is demo1.c also a statement in favour of implementing limited no.s of svcs, like httpd in the kernel? since then send loops etc. would be much tighter?
rpcan someone point me to an URL explaining those
avoozlvalgrind also might be interesting to look at, the latest version also can do cache simulation and show where in a program cache misses are occuring
psywell, i gotta go
tarzeauwere there cpu's without cache? intels 80286?
Heimyrp: a table with precalculated data
d33pwhat about grof that comes with the GNU binutils?
psygonna check the log later, see you
sarnoldtarzeau: 486s were frequently sold without cache to make them cheaper :)
tarzeaui wonder what's that altivec stuff on powerpc's
rene386 also had no onboard cache.
zwanetarzeau: you have a cache
Heimyrp: Sometimes, if you want to speed up some expensive repetitive calculations, you can make a table of precalculated data
tarzeaurene: heck some didn't have a fpu inside
Heimyrp: And just look at that table for data
rpHeimy: ok
rpHeimy: ok...getting it
tarzeauzwane: i have a cache?
runlevel0sarnold: I have an old Pentium board with this external 256 k caches
Heimyrp: But looking at that table is sometimes even more expensive than the calculation itself
zwanetarzeau: you're thinking of no L2
sh0nXthe SX is without co-pro
Heimyrp: except if it fits on caché
tarzeaubut why does cpu access to 16bytes of l1/l2 cache? the registers are 32bit since 386.. and still they are only 32bit on intel
rpHeimy: and usually the tables are in main memory, right?
aka_mc2QUESTION:  ¿¿but what is the alternative ot actual slow cache?? ;)
Heimyrp: exactly. If they're on cache memory, then it will be very fast to look up that tables
rpHeimy: oh! got it now...thanks
SnakeFoocAlan_Q: new kernels is going to bring support for machinery old so that the new ones are very great and heavy or make difficult the compilation much
renetarzeau: on Pentium (I, MMX, Pro, II, III) the cacheline-line is 32 bytes, not 16. on p4 and athlon it's 64 bytes
IkarusPPRO actually came in 2 MB aswell
tarzeauthose edo memory sticks were 60ns and 70ns, how ns are l1 and l2 caches?
tarzeau(72pin thingies)
Ikarustarzeau: iirc, about 12 ns on Pentiums for L2
sh0nXi have 60ns EDO on this P233MMX
Ikarus(or atleast on the ones I pulled apart)
sh0nXit DOES make a difference
tarzeaush0nX: i do in my sparc classic :)
|Seifer¿ Es posible que el GCC 3.2 tenga problemas al compilar ciertas cosas ?
aka_mc2DDR PC2100 can be 2.0 CAS latency
ifvoid|Seifer: could you please repeat the question in english?
debUgo-tarzeau: L1 and L2 caches runs at full CPU speed in actual processors, so, do the math ;)
sarnoldSnakeFooc: no compiler is perfect
sh0nXRun 1 (silly way) took 16 seconds.
tarzeaudebUgo-: so we loose alot of time between cpu and memory and memory and harddisk! :( umf
sh0nXRun 2 (smart way) took 10 seconds.
sh0nXRun 3 (live data only) took 5 seconds.
ifvoidRun 1 (silly way) took 5 seconds.
ifvoidRun 2 (smart way) took 2 seconds.
ifvoidRun 3 (live data only) took 0 seconds.
SnakeFoocsarnold for ?
tarzeaui lose most time turning on/off my computer
Ikarusifvoid: it optimised the third run into oblivion ?
sh0nXifvoid: what processor do you have?
sh0nXi used no optimization
debUgo-tarzeau: harddisk speed really sux =P
tarzeaudoes rms and linus also give talks on irc?
sh0nXi could go higher
debUgo-Alan_Q: how much affects cache associativeness in general memory performance?
ridiculumSnakeFooc not all kernels compile with gcc 3.2. i think the oficial compiler is still 2.95
rielrp: so your CPU spent 22 seconds playing with memory
d33pheh right ojn, compiling with optimisations is seriously skewing those results and the relative difference factor  also decreased
rielrp: and only 18 seconds on the actual calculation
rpriel: 22  sec??? how
rielrp: yes, it spends more time waiting for memory than doing something useful
rielonki: in your case it spent 7 times as much time waiting for memory as it spent doing real work
onkiyeah, I see
rpwas it because I have lot of cache (512)
onkiI only have 128
ifvoidAlan_Q: won't that change for the Hammer and Itanium 3?
rpso does that mean memory performance does not depend only on processor but also on bus speed?
runlevel0<Alan>So a dual processor machine gives us twice the problem  :so this explains why we do not get 2x the performance of 1 porcessor
bzzzAlan_Q: could you describe coherent related problem x86 has to do with cache?
debUgo-AFAIK, dual Athlon have dual bus (a bus for each CPU)
ridiculumwhat about hyperthreading and cache coherence?
sh0nXso this is where spinlocks come in
sh0nXahh ok :)
sh0nXthanks Alan :)
ifvoidsh0nX: what's a spinlock?
E0xsi es asi cual seria la ventaja de al final de los servers duales ?
docelicId appreciate more on spinlocks too
sarnoldifvoid: a processor just spins waiting for a resource to become available in a "busywait" loop
bzzzAlan_Q: how pci devices may see data which in cache only?
sh0nXifvoid: what sarnold said
AradorE0x: what's the advantage of dual servers then?
sh0nXspinlocks let the kernel use SMP
rpAlan says :  One thing the processors have to do is.......
rpisn't it the job of OS?
sarnoldrp: on some systems, yes
ridiculumcache coherece it's a hardware problem
rpoh! ok
Aradorwhat about preempt, can preempt do more cache misess?
tarzeaui noticed linux on my sparc classic (50mhz) is horribly slow compared to netbsd (1.6.x)
rene"we" have to kick? we as in the OS, or is that hardware-automatic (on x86)
sh0nXAlan_Q: so we should be using the cache for SMP processors to keep data that isnt going to change much and use the processors to handle data that does change often?
fetsalan: (ibm) this is in the -summit kernels ?
rpsh0nX: SMP?
sh0nXrp: multiprocessor
sarnoldrp: Symmettric MultiProcessor
sh0nXAlan_Q: thats terrible :/
acmeAlan_Q:  does any of the standard libjpeg, libtiff, libpng, etc take advantage of SMP in the fashion you described?
tarzeauridiculum: yyes there's sparc v7,8,9
Ikarusridiculum: hold on, sparc is different, it doesn't have FPU emu in the kernel
tarzeauapropos .. what about that mmu-less stuff? can i run linux on my amiga 1200 (standard) one soon?
sh0nXAlan_Q: so when designing SMP applications, how do we tell which processor to handle which data without causing the processors to both handle the same data?
sh0nXin the kernel we use spinlocks
sh0nXbut in userland i dont know how that works
yaluAlan_Q: is the scheduler smart enough to keep threads who share a lot of data on the same processor?
ridiculumsh0nX semaphores
sh0nXridiculum: so threads basically
ridiculumsh0nX or process. you can have 2 process (or more) with shared memory
zwaneAlan_Q: All this must get really interesting with Hyperthreaded cpus
bvcdoes it sound right that i am unable to use protection map from a module?
sh0nXwe create threads in userland, and then the kernel handles this with its scheduling
sh0nXI see now how it fits together
Alan_Qsh0nx: yep
sh0nXah :)
zwaneAlan_Q: do you reckon scheduler only would suffice? How about leveraging cpu affinity for say doing bias in interrupt handling?
sh0nXso, if a program is written for UP, how does the kernel scheduler handle its data on two CPUs? or it can't
zwaneAlan_Q: so you wouldn't go the RR ioapic interrupt distribution way as in 2.5? Or has this changed
mulish0nX, what does "written for UP" mean? single process?
sarnoldAlan_Q: does linux currently have a mechanism to specify that all interrupts should be handled by a specific [set of] CPUs?
sh0nXI see, so we have to use threads in our code in order to benifit SMP
sh0nXbenefit even
mulish0nX, threads or multiple processes
muliif you have just one thread of execution, nothing can split it up to multiple cpus for you
sarnoldsh0nX: or just run enough applications on the machine....
sh0nXmuli: forked apps?
sh0nXI see
sarnoldAlan_Q: is it worth prefetching the next pointer one is going to follow?
rielthat list structure will make for an interesting list_add() ;)
Alan_Qbut gcc doesnt always do a good job 8(
grifferzAlan_Q: how applicable are things like prefecth to userland programming without knowing about the hardware?  e.g. is it possible that doing some prefecth that speeds up something for a 2 CPU x86 system will actually cause worse performance on a 4 CPU sparc?
rielmuli: often it cannot do much
mulione more question, is there a "lowest common denominator" cache behaviour, or is it possible that optimizing for one cpu will be a pessimization on another?
rielmuli: because you tell the computer what to do
mulidid anyone measure / think about the effects of kernel preemption on cache usage?
renewouldn't turning lists into "lists of little arrays" (dynamically adding a block when required) help with the prefetching for lists?
sh0nXI assume we use some sort of spinlock to prevent another processor from prefetching the same data?
Alan_Quggh lagging
* sh0nX is beginning to understand how this all works slowly
sh0nXnow if i can figure out the kernel API ;)
Ikarusrecursivity, fun
sh0nXthats bad
sarnoldsh0nX: a good resource for more details is a book by curt schimmel, Unix Systems for Modern Architectures: symmetric multiprocessing and caching for kernel programmers
sh0nXsarnold :)
rielsarnold: a very nice book, indeed
sh0nXsarnold: once i get to understand how 2.5/2.6 works i then I can dive in more
sarnoldAlan2: i've wondered if prefetching cuts memory bandwidth significantly.. have people tested with prefetch config'ed away?
AradorAlan2: what're the effects of preempt on caching?
docelicso umeet is the place to look for similar events? some other places maybe ?
bitlandimplementation of UMA (United Memory Access, in some Silicon Graphics) in mainboard chipsets could be a better solution for most problems like these? (excuse my bad english) :)
sh0nXI always thought we wanted to flag prefetched data as being prefetched already, im surprised another processor will fetch it again.
sh0nXeven if it doesnt seem like much of a preformance penality doing so
tarzeaudocelic: i can't wait to see rms or linux talk on irc
sklavi was wondering what effect using optimization -03 or -05 have on the kernel?
tarzeaualan what irc client did you use?
sh0nXsince Alan mentioned we dont want to work on the same data on both processors prefetching the (same) data seems to contradict this?
aka_mc2ALAN: do you think that Crusoe processor, Linux supported, it will be considered for all these programmation techs??
Aradorsklav: AFAIK, -ON where N>3 means -O3....(don't know if it happens nowadays)
sklavArador: by default the kernel itself uses -02
sklavbut it can be changed in the Makefile to -03 and so on
sklavBut im nt sure if this causes other problems
sklavLike a performance hit
rieldocelic: maybe #kernelnewbies could organise some isolated lectures throughout the year
sklavi have noticed higher load averages after i use a kernel with -03 and or -05
rieldocelic: but as far as I'm concerned, UMEET is the place to cluster a bunch of lectures ;)
mulixriel, that would be a great idea
mulixlike a biweekly or monthly lecture from the kernel guru of your choice :-)
sh0nXriel: yes
tarzeaumulix: yeah i'd like that too
sh0nXriel: I'd especially like to learn how to use PnP on 2.5 ;-)
aka_mc2ALAN: there is in a future an alternative for the cache memory? (another system of fast data access can be...)
E0xthe tecnology of HT represent a avance about this problem .... ?
mattamAlan2: prediction's better than having a loop unrolled ?
sh0nXaka_mc2: parrallel data cache? ;)
aka_mc2ok, gracias shnox
IkarusAlan: do you think a L3 cache shared between SMP processors would give a significant performance benefit
sh0nXhow how about multiprocessed cache?
sh0nXaka_mc2: im just guessing
sh0nXa cache that is smart enough to decide whats needed for the processors
jacoboI wouldn't expect that for one of mine ;)
tarzeauoh my god i've missed half of the talks
debUgo-Alan2: did you assign reliability problems in Athlon systems to the CPU itself, the supporting chipsets, or any other stuff?
Alan2(getitng lots of lag problems)
Alan2I need to vanish very shortly too 8(
