Muli Ben-Yehuda's journal

July 31, 2003

Kai on the 2.5 kernel’s build system

Filed under: Uncategorized — Muli Ben-Yehuda @ 4:17 PM

Sat 13:30

what we did on Friday night:

After rik’s talk, I sat and hacked on shared page tables for a few hours, since there were no interesting talks/bofs. Talked with Shawn Starr about adding sysfs support to ISA sound drivers, and with Orna and some other people and various other things. Eventually, people drifted off to the BOFs (Orna went to the Women in Linux BOF) and I went back to the room. Eventually, finished my book (Greg Eagan’s Diaspora) (ending kind of a let down) and went to sleep.

This morning I continued hacking on shared page tables, and talked with people, some in real life, most in IRC. I had a vivid demonstration of how much I learned in OLS when someone asked a question and I was able to answer, because this exact issue was discussed in the NUMA scheduler BOF. Hooray!

Around noon Orna wanted to get something to eat, and we went to the mall. Stopped in the music shop and bought tons of DVDs and some CDs, and then had fast food sushi. What a great invention! I hope someone imports it to .il.

Now listening to Kai talk about kbuild 2.5. Nothing new so far.

Disgressing: on the shpte front, I merged almost all of the code. I have three rejects in rmap.c, where the patch doesn’t fit any of the vanilla kernels from 2.5.60 to 2.5.75. Yes, I checked:

$ export prevnum=60; for num in `seq 61 75`; do mkdir linux-2.5.$num; \ cp -al linux-2.5.$prevnum/* linux-2.5.$num; (cd linux-2.5.$num && \ bzcat ../patch-2.5.$num.bz2 | patch -p1); prevnum=$num; done

$ for num in `seq 60 75`; do (cd linux-2.5.$num && patch -p1 $num.patchresults; done

$ grep -il rej *.patchresults

I need to find out what tree it’s against, or figure out from the patch snippets what the author meant. It’s code in the swap out path (try_to_unmap_one()), which I don’t know much about – yet.

Oh, Kai is talking about the interesting stuff now:

– make some/path/file.o object file – make some/path/file.i preprocessed – make some/path/file.s assembler – make some/path/file.lst annotated asm (cool!)

– make SUBDIRS=drivers/isdn/hisax modules

Makefile changes from 2.4 – 2.6: say any given thing just once. “descend into subdir to build those files / descend into subdir to link those files into an object file / link the object file into the kernel image” can be said once with “descend into this subdir”.

After the talk, stepped out to email a reminder about August Penguin’s key signing party and see if Gilad put up the book crossing page – he did, great.

A couple of gpg people snagged me to do the identification ritual, including Olaf Kirch, of SuSE fame, who actually remembered me from our discussion re syscalltrack. I was too amazed to remember to ask him if SuSE might be interested in it. Damn!

Now sitting in the bugzilla BOF, where people are discussing the scary query page (a usability nightmare) and a mail interface to bugzilla. Neat.

Rik van Riel O(1) VM

Filed under: Uncategorized — Muli Ben-Yehuda @ 4:10 PM

Fri, 15:00

Rik van Riel, previously of connectiva, now working for redhat, talking about toward an O(1) VM.

Machines get faster, but many operations get slower

page_launder: cleaning up pages (writing them to disk) in order to evict them from memory.

Split the inactive list in order to avoid scanning the entire list: – inactive dirty – pages might be clean or dirty – inactive laundry – ? – inactive clean – clean, just get rid of it

page aging: which pages from the active list to remove? 2.4-rmap uses LFRU approximation.

page aging: sort the active list based on level of activity of pages. many lists of pages, pageout moves each list one level down (toward inactivity), page_referenced bit moves a page up one list.

reverse mapping fundamentals:

regular reverse mapping (per page) simple, no corner cases set it up on fork, tear it down on exec – overhead!

object based rmap: rmaps non-existing ptes as well (because it’s per vma) needs to be searched in pageout path only works for linearly mapped file backed objects nasty interactions with truncate, remap_file_pages

conclusions: bottlenecks keep moving access patterns keep changing computers, processes keep growing

VM needs to be adjusted: more intelligent writeout of dirty pages better replacement algorithms beter search algorithms more scalable locking

O(1) VM is probably impossible, due to the problem space

Research required for a VM design that fits modern machines and workloads [maybe something I could work on…!]


Filed under: Uncategorized — Muli Ben-Yehuda @ 3:59 PM

Fri 10:05 AM

Greg KH is talking about “udev – devfs done right” (I wonder if rgooch [devfs maintainer] is here… No screams of rage so far, so I guess not)

problems with devfs: implementation (although hch cleaned it up in 2.5) no dynamic allocation

udev – combine /sbin/hotplug (notification whenever a device is added) and sysfs (information about devices) to do device addition/deletion to /dev and naming in user space.

very active crowd participation on this talk… I’m mostly tuned out, writing my notes. After the talk, sat outside, did the gpg identification ritual with a few more people, talked with Behdad, Orna and Pat Gaughen of IBM LTC in real life, and Con Kolivas on IRC. Freaky, the way IRC conversations and real life conversations blend in together.


Filed under: Uncategorized — Muli Ben-Yehuda @ 3:58 PM

Thu 20:15

LSE project.

Hanna Linder gave a short introduction to lse, and then mjb talked about his -mjb tree

mjb talks about NUMA scheduler: – basically, run the usual SMP scheduler inside each NUMA node – bouncing a process happens on exec (moving procs between nodes is expensive if they end up with all of their memory allocated from the wrong node)

the discussion moved on to how to recognize if the process is forking and execing, or just forking (in order to know if we should migrate it or not). Suggestions include

– saving previous runs information (just like my persistent scheduler idea!) in various places, like /etc/database, or the inode, or in memory, or in the elf header.

– giving user space an API for telling the OS what it’s about to do (which strikes me as the right thing to do, for psched as well… does such an API exist for applications to say whether they are interactive or CPU bound? nice(1), I guess, as well as the real time priorities).

I wonder, is there a point in a general utility/library that profiles an process, to take into account system calls, page faults, any other events? a “histogram” of the process, to anticipate its future behavior.

I’m glad I went to this BOF. The scheduler discussion showed that my psched idea is flawed to begin with, before I spent too much time researching and implementing it. Oh well, there will be other ideas.

Asynch IO for 2.5 and IBM banquet dinner

Filed under: Uncategorized — Muli Ben-Yehuda @ 3:58 PM

Thu 16:30

Async IO for 2.5.

Walked out of this one pretty soon after it started, due to the speaker’s indian accent and the light bulb that went up over my head – I can use xchg() to replace the syscall pointer atomically! Rushed out to finish my “how to port syscalltrack to 2.5” paper and email it to sct-hackers. Also Did the identity ritual with a few more people.

Today was the banquet dinner, sponsored by IBM. Had it with sarnold, mharris@redhat and another Redhat guy. The food was reasonable, the Alan Cox Stories were superb.

dinner speech was by Ian Stewart, of grid fame, and I spent most of it thinking about projects I could do independent research on:

– continue developing syscalltrack. – persistent scheduler. – storage intrusion detection.

Need to think more about all of them.

Dave McCracken’s Shared Page Tables talk

Filed under: Uncategorized — Muli Ben-Yehuda @ 3:55 PM

Thu 15:00 PM

Listening to Dave McCracken’s Shared Page Tables talk. This is the most interesting talk I’ve heard so far, not in least because it’s something that I want to work on. [Later in the day, I did start working on it].

– shared memory areas mapped in many address spaces can take up more space in page table space than in data space.

– mm_struct: one per address space

– vma: one vma per mapped area per address space – linked list and tree anchored in mm_struct – describes a virtual address range and protection – reference to the backing file – anonymous vmas – have no backing file

– page table – one page table for each address space – pointed to from mm_struct – three levels – pgd, pmd, pte – doubles as hardware page table for most archs

– one address_space structure per open file. struct address_space does not describe an address space! it describes a file… – anchors list of all vmas that map a region of the file – contains a page cache of all physical pages containing data form the file

– struct page: one per physical page – describes how the page is used – has a pointer to address_space if it’s mapping data from a file – all page structs live in mem_map – with rmap – has a back pointer (or array of back pointers) to all of the ptes that map the page

– to create a new memory area – either mmap or shmemap – all shmem is file backed, either explicitly or implicitly via shmfs (internal file system) – if a page is marked prive and read_write, modified pages are converted to anonymous and backed by swap

– a page is only mapped when a task faults trying to access it – fault code finds the correct vma and pte entry, then finds and maps the page. if necessary, the pte page is allocated on the fly.

– mm subsytem has three primary locks: – read/write semapore, mmap_sem in mm_struct, protects the vma chain. taken for read during a page fault, taken for write for mmap, f.e. – spinlock page_table_lock protects the page_table – i_shared_sem in address_space protects a file’s vma chain. used to be a spinlock in 2.4, turned into a semaphore in 2.5

– sharing pte pages: – overhead for singly mapped area is small – overhead for each area grows linearly with number of mappings – massively mapped areas could use more physical pages memory for page tables than data pages – pte pages for large shared areas are identical in each address_space

[shared segments which aren’t mapped in the same virtual addresses aren’t currently considered shared – TODO ;-)]

– finding shareable pages: – vma must be shareable, must span entire pte page – walk address_space chain of vmas looking for one mapping the range – check the pte page for each mapping vma to see if it can be shared

– setting the pmd entry read-only allows you to do copy-on-write of pte pages?

[forks slowed down significantly in 2.5, due to rmap pte chains, and then shared pte sped that up again]

– locking changes: page_table_lock breaks when pte pages are shared – new lock in pte_page_lock protects pte page

– complications – reverse mapping includes pointer to mm_struct – shared page tables pages may need pointers to multiple mm_structs – pointer had to be converted to a chain – several system calls may modify mappings and require unsharing pte pages

[philosophy: better safe then sorry, if not 100% sure that the sharing is correct, unshare it]

– primary motivation of the project is reduction of memory overhead [page tables live in lowmem]

– COW improves fork performance by factor of 10 – unsharing costs as much as fork without COW, plus a little extra – all programs unshare at least 3 pte pages – small programs only have 3 pte pages – simple hack is to not do COW for such programs (with only 3 pte pages)

– kernel compile showed no change when sharing pte pages – applications with massively shared areas benefited indirectly from the extra avaliable memory

– status: patch was stable in about mid-novemeber last year – the patch is still there and dmc is still maintaining it – talk to dmc for his copy for the patch

during the break, met Jeff Dike in person, who told me that shared page tables should go into UML rather effortlessly, since the code is very similar in its organization, and also talked to dmc, who said that the patch -mjb is pretty much up to date.

Bill Irwin’s PGCL (Page Clustering)

Filed under: Uncategorized — Muli Ben-Yehuda @ 3:41 PM

Thu, 13:30 PM

Listening to Bill Irwin‘s PGCL (Page Clustering) talk now.

– teach the VM to handle “partial pages” – why do you want to do this? structures sized based on memory take up less space. – shrinking search structures containing pages (radix tree, LRU page replacement lists) – ABI preserving variant is backward compatible – Kernel Summit news: Linus is actually interested in something like this, but generic

Random thought, while my brain takes a break from trying to decipher what Bill is saying: going to OLS with Orna wasn’t the best idea – I’m interested in the talks and meeting people (which I’m not doing well at all, granted), and Orna is more interested in touring Ottawa, etc.

– early boot issues stemming from lots of places assuming that virtual pages are MMUPAGE_SIZE in size, where PGCL does MMUPAGE_SIZE != PAGE_SIZE.

Another random thought: this is one of the most interesting talks / projects presented here, but I’m having a hell of a time understanding Bill. Shame.

– MMUPAGE_SIZE is the physical size of pages – PAGE_SIZE is the virtual size of pages

– Bill is talking about various bugs he had and how he solved them – combination of luck and looking around for hints, auditing code (always necessary when changing fundamental design assumptions…)

Another talk where reading the paper make more sense than listening to the talk. Seems to be quite a lot of them, unfortunately. Good thing the corridor and pub discussions make up for it.


Filed under: Uncategorized — Muli Ben-Yehuda @ 3:39 PM

Thu. 12:30 AM

During the break between this talk and Dave Jones talk, helped Behdad Esfaboud with compiling the cipe module. Since it was b0rking with stuff related to module symbols, I just turned off module versions in the .config and that did the trick. Also talked to zwane about system call hijacking in

Afterward, I was walking around forlornly, looking for people to talk to and not knowing how to start, when Orna hit upon a brilliant idea. The symposium has a gpg key signing party, and participants are supposed to verify each other’s identity between the talks. Participants also wear a red dot on their badges, so that they can identify each other. Orna just looked around, found the first guy that had a red dot on his badge (Ryan, one of the debian developers), and started talking to him, using the gpg key signing “verification ritual” as a start. Brilliant! That’s what we did for the rest of the break, too. Right now, I have 20 or so of 130 identifications.

Listening to a talk on the Lustre distributed file system now:

– concentrating on integrations with the Linux VFS. Eventually, eschewed using the dcache completely since the VFS wants to lock directories, which is very bad for lustre.

– Uses object protocols (rather than block based protocols)

– gigabytes of debug informations for simple I/O operations. How to make sense of so much data?

– use debug tools extensively – UML, gdb, mcore, netdump, crash, kgdb.

– debugging distributed systems is hard, but what else is new?

– how do you handle disk failures? we emphatically say: “that’s not our problem!”

– “Linux machines will beat out all of the EMC machines, netapps, etc for data storage”

– they have a single MDS (Meta Data Server). Obvious scalability bottleneck? yes, but a pretty far out bottleneck. MDS is a really quick, fast machine. Currently, the limit is about 5,000 file creations per second.

– “TCP/IP offload cards suck”

– does not scale down as well as they scale up, but plan to work on it. The architecture should support it – would consider it a personal failure otherwise.

Very interesting talk, all in all.

Dave Jones Resurrecting Unmaintained Code

Filed under: Uncategorized — Muli Ben-Yehuda @ 3:36 PM

Thu, 10:00 AM

Listening to Dave Jones talk about resurrecting unmaintained code. This is one of the “fluffier” talks, where it seems to me that it was accepted to OLS more on because of who the speaker is, than because of technical interest. Regardless, Dave is an entertaining speaker. Despite some technical difficulties at the beginning of the talk, it is going rather well.

So far, lots of common sense on how to write maintainable code. For example, write small functions that do one thing well (some of my coworkers should have this tattoo’ed on their foreheads) and “put different functionality in different .c files”. TODO: Code snippets to the contrary of these dictums would make good All Code Sucks posts.

The talk’s examples examples include the MTRR[0] driver and agpgart code.

I spent most of this talk thinking about and writing a “how to port syscalltrack’s system call hijacking code to 2.5”

[0] Documentation/mtrr.txt: “On Intel P6 family processors (Pentium Pro, Pentium II and later) the Memory Type Range Registers (MTRRs) may be used to control processor access to memory ranges. This is most useful when you have a video (VGA) card on a PCI or AGP bus. Enabling write-combining allows bus write transfers to be combined into a larger transfer before bursting over the PCI/AGP bus. This can increase performance of image write operations 2.5 times or more.”

July 26, 2003

be back after the commercials

Filed under: Uncategorized — Muli Ben-Yehuda @ 11:09 PM

Network going down now. OLS great. Be back on Wed, rest of OLS updates then.

Muli, signing off from the OLS soon to no longer be network room. Cheers!

Next Page »

Blog at