Muli Ben-Yehuda's journal

January 16, 2012

Sometimes, a paper is more than just a paper

Filed under: Uncategorized — mulix @ 10:51 AM

Sometimes, a paper is more than just a paper. Around late 2005 or early 2006 I started working on direct device assignment, a useful approach for I/O virtualization where you give a virtual machine direct access to an I/O device so that it can read and write the physical machine’s memory without hypervisor involvement. The main reason to use direct device assignment is performance: since you bypass the hypervisor on the I/O path, it stands to reason that for I/O intensive workloads — the hardest workloads to virtualize — direct device assignment would provide bare-metal performance. Right?

Wrong. Since 2006, we’ve seen again and again that even with direct device assignment virtual machines performance falls far short of bare-metal performance for the same workload. Sometime in 2009, we realized that after you solve all other problems, one particular thorny issue remains: interrupts. The interrupt delivery and completion architectural mechanisms in contemporary x86 machines, even with the latest virtualization support, were not designed for delivering interrupts directly to untrusted virtual machines. Instead, every hypervisor programs the interrupt controllers to deliver all interrupts directly to the hypervisor, which then injects the relevant interrupts to each virtual machine. For interrupt-intensive virtualized workloads, these exits to the hypervisor can lead to a massive drop in performance.

Although it is possible to work around the interrupt issue by modifying the virtual machine’s device drivers to use polling, as we did in the Turtles paper and in the Tamarin paper that will be presented in FAST ’12, it always annoyed me that the promise of bare-metal performance for virtual machines remained unreachable for unmodified virtual machines. That is, until now.

Through the amazing work of a combined IBM and Technion team, we came up with an approach — called ELI, for Exitless Interrupts — that allows direct and secure handling of interrupts directly in virtual machines — without any changes to the underlying hardware. With ELI, direct device assignment can finally do what it was always meant to do: provide virtual machines with bare-metal performance. It is nice to look back at the research over the last five or six years that lead us to this point; it will be even nicer, when we present this work in ASPLOS in London in a couple of months, to ponder what other breakthroughs the next few years hold.

“ELI: Bare-Metal Performance for I/O Virtualization”, by Abel Gordon, Nadav Amit, Nadav Har’El, Muli Ben-Yehuda, Alex Landau, Assaf Schuster and Dan Tsafrir. In ASPLOS ’12: Seventeenth International Conference on Architectural Support for Programming Languages and Operating Systems.

Direct device assignment enhances the performance of guest virtual machines by allowing them to communicate with I/O devices without host involvement. But even with device assignment, guests are still unable to approach bare-metal performance, because the host intercepts all interrupts, including those interrupts generated by assigned devices to signal to guests the completion of their I/O requests. The host involvement induces multiple unwarranted guest/host context switches, which significantly hamper the performance of I/O intensive workloads. To solve this problem, we present ELI (ExitLess Interrupts), a software-only approach for handling interrupts within guest virtual machines directly and securely. By removing the host from the interrupt handling path, ELI manages to improve the throughput and latency of unmodified, untrusted guests by 1.3x — 1.6x, allowing them to reach 97%–100% of bare-metal performance even for the most demanding I/O-intensive workloads.

January 2, 2012

New year, same stuff

Filed under: Uncategorized — mulix @ 11:10 PM

I guess I should write something here, but I am not quite sure what. My life is a roller-coaster of the mundane; rarely do I have a chance to sit back and pontificate. So, all is well, work work work study research kids school work sleep work work work fun! Not that I am complaining, mind you.

Perhaps I’ll write more tomorrow. Or in three months. We’ll see.

August 25, 2011

New Paper: Deconstructing Amazon EC2 Spot Instance Pricing

Filed under: Uncategorized — mulix @ 2:22 PM

Ever wonder how Amazon prices its spot instances? Or, having dug deeper, perhaps wondered why the prices sometimes appear a little funny? Wonder no more: Orna Agmon Ben-Yehuda tells the gruesome story of how Amazon really prices its spot instances in our new paper Deconstructing Amazon EC2 Spot Instance Pricing. Warning: not for the faint of heart.

Cloud providers possessing large quantities of spare capacity must either incentivize clients to purchase it or suffer losses. Amazon is the first cloud provider to address this challenge, by allowing clients to bid on spare capacity and by granting resources to bidders while their bids exceed a periodically changing spot price. Amazon publicizes the spot price but does not disclose how it is determined.

By analyzing the spot price histories of Amazon’s EC2 cloud, we reverse engineer how prices are set and construct a model that generates prices consistent with existing price traces. We find that prices are usually not market-driven as sometimes previously assumed. Rather, they are typically generated at random from within a tight price interval via a dynamic hidden reserve price. Our model could help clients make informed bids, cloud providers design profitable systems, and researchers design pricing algorithms.

Academic Highs and Lows

Filed under: Uncategorized — mulix @ 12:33 PM

One of the reasons I love the academic life is the built-in highs. There’s nothing quite the high you get when you make a discovery, or when something finally works like it should. I won’t lie: I’ve been known to do the happy happy joy joy dance in the halls on such occasions. The high when a paper is accepted lasts for a few days; winning a prestigious award is a rare pleasure and the high lasts longer. Learning that someone else cites your work is always nice, especially if it causes the all-important h-index to rise, as it did last night.

But, with the highs also come the lows: rejection never ceases to hurt, and at least statistically, most papers will be rejected before they get accepted. But you know what, that’s OK too, because hurting when your paper gets rejected just means you care. Without lows, there could not be any highs — and it’s the highs that matter.

June 14, 2011

Odds and Ends from USENIX FCW11

Filed under: Uncategorized — mulix @ 9:56 PM

This week I am at the USENIX Federated Conferences Week in lovely Portland, Oregon. Yesterday Orna and I walked back and forth through Portland, today I am at the 3rd Workshop on I/O Virtualization and tomorrow I will be at the USENIX Annual Technical Conference.

I am wearing multiple hats this week. Earlier this morning I presented our work on “SplitX: Split Guest/Hypervisor Execution on Multi-core”, and Abel Gordon presented our work on “VAMOS: Virtualization Aware Middleware”. Both presentations went well, if I might say so myself. Nadav Amit will present our work on “vIOMMU: Efficient IOMMU Emulation” tomorrow — if his flight makes it here in time. I will also participate in a panel on challenges in cloud I/O later today, will be chairing a session on Friday, and will be summarizing several sessions for ;login today and tomorrow.

I like being here at USENIX ATC and WIOV; the topics are close to my heart and the halls are full of familiar faces. It is also good, however, to step outside of one’s comfort zone every so often. Accordingly I agreed to serve on the technical program committee for SPSN ’11: The First International Workshop on Security and Privacy in Social Networks. This should be interesting.

Last but not least, a journey that started almost two years ago reached its end — at least the end of the beginning — recently when Marcelo Tosatti applied the nested VMX patchset to the KVM tree. Kudos to Nadav Har’El who took a bunch of research code (in both the best and worst senses of the term) and continuously polished and rewrote it until it finally conformed to the highest standards of open source development. Thanks to Nadav’s tireless efforts, nested virtualization on Intel machines will soon be available on every KVM deployment near you.

May 13, 2011

Expertly running scientific tasks on grids and clouds

Filed under: Uncategorized — mulix @ 10:10 AM

My wife Orna Agmon Ben-Yehuda is a graduate student at the Technion CS department, working with Prof. Assaf Schuster. Orna recently published a paper discussing part of her PhD work that I think is amazing work. ExPERT: Pareto-Efficient Task Replication on Grids and Clouds tackles the following problem. Let’s say you are a scientist and you have a collection of computational tasks that you want to run. You also have several resources at your disposable for running these tasks. For example, you could run some — or all — of your tasks on Amazon’s EC2 cloud, which costs money but provides fairly high reliability and quick turnaround, or you could run some — or all — of your tasks on a local computational grid, which is free, but also unreliable and slow. How do you choose?

Orna’s paper tackles this question systematically and makes several contributions. First, it proposes a useful model for reasoning about this problem. Second, it presents ExPERT, a framework that helps the scientist characterize and understand the range of potential choices, assisting the user in picking the right choice. In particular, ExPERT finds those specific choices that are in the the Pareto frontier: the set of choices that are better in some sense than all other choices, and no worse than others. Using ExPERT, a scientist who cares more about cost could pick the cheapest option for running her tasks, while a scientist who cares about response time could pick a more expensive option that provides the quickest response. Another scientist could choose how to balance the two, getting the best response time for a given budget, or the best cost for a given response time. The full abstract is below, and the paper is available here.

Many scientists perform extensive computations by executing large bags of similar tasks (BoTs) in mixtures of computational environments, such as grids and clouds. Although the reliability and cost may vary considerably across these environments, no tool exists to assist scientists in the selection of environments that can both fulfill deadlines and fit budgets. To address this situation, in this work we introduce the ExPERT BoT scheduling framework. Our framework systematically selects from a large search space the Pareto-efficient scheduling strategies, that is, the strategies that deliver the best results for both makespan and cost. ExPERT chooses from them the best strategy according to a general, user-specified utility function. Through simulations and experiments in real production environments we demonstrate that ExPERT can substantially reduce both makespan and cost, in comparison to common scheduling strategies. For bioinformatics BoTs executed in a real mixed grid+cloud environment, we show how the scheduling strategy selected by ExPERT reduces both makespan and cost by 30%-70%, in comparison to commonly-used scheduling strategies.

May 12, 2011

Greek adventure

Filed under: Uncategorized — mulix @ 4:09 PM

Sitting inside working in my office in Haifa on a beautiful day, instead of doing something fun outside in the sun, always feels slightly wrong. Sitting in a conference room in Heraklion, Crete, and looking out of the window at the gorgeous view outside only feels slightly wronger.

Orna and I are currently in beautiful Heraklion, Crete. I am here mostly for work — the IOLanes EU research project meeting and review — and for a little vacationing; Orna is in full vacation mode. We both badly need a vacation after the pressure-cooker of the ACM Symposium on Cloud Computing deadline. During the last couple of weeks of April we worked around the clock, culminating in the submission of three different papers on April 30th. IOLanes provided a much needed opportunity to visit Greece and rest a little.

We were suppose to arrive in Heraklion yesterday morning, but ended up spending a few hours in Athens first, courtesy of striking air traffic controllers. After landing in Athens we took the bus from the airport to the city center, where we walked around and enjoyed Greek hospitality and cooking. We had planned to return to the airport the same way, but when we returned to the bus stop a few hours later, we discovered tens of thousands of demonstrators blocking the bus station and nearby roads. Our first indication of imminent trouble was the large number of uniformed police officers blocking the streets in full riot gear, including gas masks. Moving closer, we saw the more experienced on-lookers wearing scarves or surgical masks. While the demonstration was peaceful at that time, clearly, tensions were running high and violence was not far off. Orna was feeling adventurous and wanted to hang around and try to intercept the bus somewhere along its route, but cooler heads — mine — prevailed. Once we saw the news agent lock and barricade his shop, we backtracked out of the danger area and found a taxi back to the airport. When we arrived to the airport we discovered that the demonstration indeed turned violent shortly after we left. Nonetheless, we did not let the unexpected adventure deter us from enjoying Greek hospitality and most excellent cooking.

Today and tomorrow I am working, and Saturday we will be dedicated to playing tourist. Next week it’s back to Haifa, with a detour to Tel-Aviv University on Tuesday evening where I will be giving the Turtles talk to the local defcon chapter dc9723. Should be an interesting experience with a different crowd than usual — younger and hopefully rowdier :-)

April 10, 2011

3rd Workshop on I/O Virtualization, VAMOS and SplitX

Filed under: Uncategorized — mulix @ 1:15 PM

The program committee meeting for the 3rd Workshop on I/O Virtualization was held this Friday. I like the resulting program quite a bit, regardless of the fact that two of our submissions—VAMOS and SplitX—were accepted. WIOV is probably my favorite workshop ever, and this year it will be held again with the USENIX Annual Technical Conference, another favorite venue. The full program will be available online in a week or two.

Our two papers which have been accepted, “SplitX: Split Guest/Hypervisor Execution on Multi-Core” (joint with with Alex Landau and Abel Gordon) and “VAMOS: Virtualization Aware Middleware” (joint with Abel Gordon, Dennis Filimonov, and Maor Dahan) tackle the I/O virtualization problem from two different directions. VAMOS follows the same general line of thought as our earlier Scalable I/O and IsoStack work. Raising the level of abstraction of I/O operations—socket calls instead sending and receiving Ethernet frames, file system operations instead of reading and writing blocks—improves I/O performance because it cuts down the number of protection-domain crossings needed. In VAMOS, we perform I/O at the level of middleware operations, with the guest passing database queries to the hypervisor instead of reading and writing disk blocks. This gives a nice boost to performance, as you might expect, and is fairly easy to do taking advantage of the inherent modularity of middleware—which to me was a surprising result.

SplitX is a whole other kettle of fish. It has been clear to us for some time that the inherent overhead of x86 machine virtualization is tied to the trap-and-emulate model, as can be seen perhaps most clearly in the Turtles paper. With the trap-and-emulate model, both direct and indirect overheads are inherent in the model, because we time-multiplex two different contexts (the guest and the hypervisor) onto the same CPU core, incurring both the switch overhead and the indirect cost of dirtying the caches. But what if we could run guests on their own cores, and hypervisors on their own cores, and never the twain shall meet? SplitX presents our initial exploration of this—very promising, if I may say so myself—idea.

The papers will be available online later, but shoot me an email to get the current draft.

March 15, 2011

vIOMMU paper to appear in 2011 USENIX Annual Technical Conference

Filed under: Uncategorized — mulix @ 11:33 PM

Well, it’s official: our vIOMMU paper, which I wrote about previously, has been accepted to the 2011 USENIX Annual Technical Conference. I love it when that happens :-)

March 2, 2011

vIOMMU: Efficient IOMMU Emulation

Filed under: Uncategorized — mulix @ 4:01 PM

My colleague Nadav Amit will be presenting his M.Sc. research, which I had the pleasure of helping with, this upcoming Sunday. The summer before last Nadav did a summer internship with my group at the Haifa Research Lab. Nadav’s internship was dedicated to analyzing the IOTLB behavior of ontemporary IOMMUs, and resulted in this WIOSCA paper. In order to analyze IOTLB behavior, we had to first collect traces of how modern operating systems set-up their DMA buffers, and to do that, Nadav developed IOMMU emulation in KVM.

For his M.Sc., Nadav researched how to emulate IOMMUs efficiently, leading to two primary contributions: first, that waiting just a few milliseconds before tearing down an IOMMU mapping can boost performance substantially due to high temporal reuse. Second, that is possible to emulate a hardware device without trapping to the hypervisor on every device interaction, by using a separate core (a sidecore) to run the device emulation code. The full abstract is below, and everyone is invited to the talk.

Direct device assignment, where a guest virtual machine directly interacts with an I/O device without host intervention, is appealing, because it allows an unmodified (non-hypervisor-aware) guest to achieve near-native performance. But device assignment for unmodified guests suffers from two serious deficiencies: (1) it requires pinning of all the guest’s pages, thereby disallowing memory overcommitment,
and (2) it exposes the guest’s memory to buggy device drivers.

We solve these problems by designing, implementing, and exposing an emulated IOMMU (vIOMMU) to the unmodified guest. We employ two novel optimizations to make vIOMMU perform well: (1) waiting a few milliseconds before tearing down an IOMMU mapping in the hope it will be immediately reused (“optimistic teardown”), and (2) running the vIOMMU on a sidecore, and thereby enabling for the first time the use of a sidecore by unmodified guests. Both optimizations are highly effective in isolation. The former allows bare-metal to achieve 100% of a 10Gbps line rate. The combination of the two allows an unmodified guest to do the same.

Next Page »

The Rubric Theme Blog at


Get every new post delivered to your Inbox.