<!-- MHonArc v2.4.4 --> <!--X-Subject: [MUD-Dev] I/O Event Handling Under Linux by Richard Gooch --> <!--X-From-R13: X Q Znjerapr <pynjNhaqre.rate.ftv.pbz> --> <!--X-Date: Thu, 25 Jun 1998 23:05:43 -0700 --> <!--X-Message-Id: 199806251857.LAA02810#under,engr.sgi.com --> <!--X-Content-Type: text/plain --> <!--X-Head-End--> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 3.2//EN"> <html> <head> <title>MUD-Dev message, [MUD-Dev] I/O Event Handling Under Linux by Richard Gooch</title> <!-- meta name="robots" content="noindex,nofollow" --> <link rev="made" href="mailto:claw#under,engr.sgi.com"> </head> <body background="/backgrounds/paperback.gif" bgcolor="#ffffff" text="#000000" link="#0000FF" alink="#FF0000" vlink="#006000"> <font size="+4" color="#804040"> <strong><em>MUD-Dev<br>mailing list archive</em></strong> </font> <br> [ <a href="../">Other Periods</a> | <a href="../../">Other mailing lists</a> | <a href="/search.php3">Search</a> ] <br clear=all><hr> <!--X-Body-Begin--> <!--X-User-Header--> <!--X-User-Header-End--> <!--X-TopPNI--> Date: [ <a href="msg01207.html">Previous</a> | <a href="msg01209.html">Next</a> ] Thread: [ <a href="msg01209.html">Previous</a> | <a href="msg01216.html">Next</a> ] Index: [ <A HREF="author.html#01208">Author</A> | <A HREF="#01208">Date</A> | <A HREF="thread.html#01208">Thread</A> ] <!--X-TopPNI-End--> <!--X-MsgBody--> <!--X-Subject-Header-Begin--> <H1>[MUD-Dev] I/O Event Handling Under Linux by Richard Gooch</H1> <HR> <!--X-Subject-Header-End--> <!--X-Head-of-Message--> <UL> <LI><em>To</em>: <A HREF="mailto:mud-dev#kanga,nu">mud-dev#kanga,nu</A></LI> <LI><em>Subject</em>: [MUD-Dev] I/O Event Handling Under Linux by Richard Gooch</LI> <LI><em>From</em>: J C Lawrence <<A HREF="mailto:claw#under,engr.sgi.com">claw#under,engr.sgi.com</A>></LI> <LI><em>Date</em>: Thu, 25 Jun 1998 11:57:47 -0700</LI> <LI><em>Reply-To</em>: <A HREF="mailto:mud-dev#kanga,nu">mud-dev#kanga,nu</A></LI> </UL> <!--X-Head-of-Message-End--> <!--X-Head-Body-Sep-Begin--> <HR> <!--X-Head-Body-Sep-End--> <!--X-Body-of-Message--> <PRE> The following explains why I've been seeing such poor performance when I moved my multiplexed IO structure (IO processing by a thread pool which synchonously handles all connections) to Linux, as well as detailing several other interesting IO handling concerns. You may not be worried about this sort of thing now. You will be when you're attempting to cram several thousand players and bots onto the same box. URL:<A HREF="http://www.atnf.csiro.au/~rgooch/linux/docs/io-events.html">http://www.atnf.csiro.au/~rgooch/linux/docs/io-events.html</A> --<cut>-- I/O Event Handling Under Linux Richard Gooch 24-JUN-1998 I/O Event handling is about how your Operating System allows you to manage a large number of open files (file descriptors in UNIX/POSIX, or FDs) in your application. You want the OS to notify you when FDs become active (have data ready to be read or are ready for writing). Ideally you want a mechanism that is scalable. This means a large number of inactive FDs cost very little in memory and CPU time to manage. Traditional Unix systems provide the select(2) and/or poll(2) system calls. With both of these you pass an array of FDs to the kernel, with an optional timeout. When there is activity, or when the call times out, the system call will return. The application must then scan the array to see which FDs are active. This scheme works well with small numbers of FDs, and is simple to use. Unfortunately, for thousands of FDs, this does not work so well. The kernel has to scan your array of FDs and check which ones are active. This takes approximately 3 microseconds (3 us) per FD on a Pentium 100 running Linux 2.1.x. Now you might think that 3 us is quite fast, but consider if you have an array of 1000 FDs. This is now 3 milliseconds (3 ms), which is 30% of your timeslice (each timeslice is 10 ms). If it happens that there is initially no activity and you specified a timeout, the kernel will have to perform a second scan after some activity occurs or the syscall times out. Ouch! If you have an even bigger application (like a large http server), you can easily have 10000 FDs. Scanning times will then take 30 ms, which is three timeslices! This is just way too much. You might say that 3 ms for 1000 FDs is not such a big deal: a user will hardly notice that. The problem is that the entire array of FDs is scanned each time you want to go back to your polling loop. The way these applications work is that after checking for activity on FDs, the application processes the activity (for example, reading data from active FDs). When all the activity has been processed, the application goes back to polling the OS for more FD activity. In many cases, only a small number of FDs are active at any one time (say during each timeslice), so it may only take a few milliseconds to process all the activity. High performance http servers can process hundreds or thousands of transactions per second. A server that takes 2 ms to process each active FD can process 500 transactions per second. If you add 3 ms for FD scanning in the kernel, you now have 5 ms per transaction. That only gives 200 transactions per second, a massive drop in performance. There is another problem, and that is that the application needs to scan the "returned" FD array that the kernel has updated to see which FDs are active. This is yet another scan of a large array. This isn't as costly as the kernel scan, for reasons I'll get to later, but it is still a finite cost. Some other operating systems provide a mechanism called I/O completion ports and some have event queues. These are a mechanism to tell the OS that you want to be notified when there is activity on a FD. Usually, when a FD becomes active (i.e. becomes ready for reading or writing) , the OS will send a message (a "readiness event") to a "port" (perhaps another FD such as a pipe). This message contains the FD that has become active. The one port can have many readiness events from many FDs sent to it. The key difference here is that you do not need to pass the OS a massive FD array each time you want to listen for events: you only need to tell the OS once for each FD that you want to receive readiness events for that FD. The kernel no longer needs to scan a massive FD array each time through your polling loop, and nor does the application. This is an appealing approach, and scales very well. Unfortunately, I/O readiness queues are not POSIX. They're not even Unix. We could add them to Linux, but it would mean that applications that relied on this mechanism would be unportable, Linux-only. Also, it would involve significant additions to the kernel, which sets off my bloat alarms. I would hope that it is possible to develop an effective alternative that uses standard POSIX/UNIX functionality. There are improvements we can make for the massive FD scanning problem. Firstly we can optimise the way the scanning is done inside the kernel. Right now (2.1.106) the kernel has to call the poll() method for each file structure. This is expensive. Back in the 2.1.5x kernels, I coded a better implementation for the kernel which sped things up almost 3 times. While this requires modifications to drivers to take advantage of this, it has the advantage of not changing the semantics we expect from UNIX. Note one other interesting feature of this optimisation: it centralises event notification, which in turn would make implementing I/O readiness queues simpler. I'm not sure how closure of FDs before readiness events are read should be handled. This could complicate their implementation. Another solution (which would also benefit from the kernel optimisation discussed above) is for the application to divide the FD array into a number of smaller FD arrays, say 10. You then create 10 threads, each of which has a polling loop using its smaller FD array. So each FD array is now 100 entries long. While this doesn't change the total number of FDs that must be scanned, it does change when they have to be scanned. Since most FDs are inactive, not all the threads will be woken up. Too see how this works, consider the example where, at any time (say during a single timeslice of 10 ms), only 5 FDs are active. Assuming these FDs are randomly, uniformly distributed, at most 5 threads will need to be woken up. These threads then process the activity and go back to the start of their polling loops. Where we win is that only 5 threads had to go back and call select(2) or poll(2). Since they each have 100 entry FD arrays, the kernel only has to scan 500 FDs. This has halved the amount of scanning required. The scanning load has gone from 30% to 15% by this simple change. If you were to instead use 100 threads, you would still only have at most 5 threads woken up for activity, and hence the total number of FDs scanned this timeslice would be 50. This takes down the scanning load to 0.15%, which is negligible. There is one thing to watch out for here: if you use select(2) in your polling loop, be aware that the size of your FD array is equal to the value of your largest FD. This is because select(2) uses a bitmask for its FD array. This means one of your threads will want to poll FDs 991 to 1000. Unfortunately, your FD array is still 1000 long. What's worse, the kernel still has to do a minimal scan for all those 1000 FDs. The solution to this is to use poll(2) instead, where you only have to pass as many FDs as you want to poll, and the kernel scans only those. This solution sounds ideal: just create lots and lots of threads. At the extreme, you create one thread per FD. There is a problem here, however, as each thread consumes system resources. So you need to compromise between the number of threads and the FD scanning load. Also, the more threads you have the more cache misses you induce, so this is something to avoid as well. Fortunately in this case most threads will be running nearly the same code at the same time, so cache pollution should not be a significant problem. A more advanced solution is to have dynamic migration of FDs depending on whether they are mostly active or inactive. In the simplest case, you only have two threads. One which polls mostly active FDs and the other polls mostly inactive FDs. The thread for active FDs will be woken up very frequently, but on the other hand will have only a small number of FDs to scan. The other thread will have to scan a large number of FDs, but it will only be woken up occasionally. For each FD an activity counter is kept. When a FD on the mostly inactive list is deemed to be fairly active, it is migrated to the mostly active list. A reverse operation occurs for fairly inactive FDs on the mostly active list. I favour this solution, since it can be implemented solely in userspace and is portable to other POSIX systems. I have an existing software library which has (amongst other things) support for managing events on FDs. I plan on extending this library to use the above technique. The library is distributed under the LGPL. Watch this space for results. My approach is to squeeze as much performance as we can out of the existing POSIX/UNIX interface by optimising the kernel and doing clever things in userspace. We can then evaluate how much of a bottleneck polling is under Linux. If polling overheads (for very large numbers of FDs) are kept to within a few percent, I firmly believe there is no need to I/O readiness queues. If polling overheads remain above 10%, then we should consider I/O readiness queues. OK, now I'll get back to another point I raised earlier, and that is the time taken for the application to scan a list of FDs for activity. We would like to reduce this time as well, and we can without much effort at all. Instead of using poll(2) we can implement poll2(2), a new syscall which is very much like poll(2) but returns an array of active FDs. So now the application only needs to search a very small array. This new syscall is based on the principle of not duplicating work. The kernel has just gone and scanned all these FDs for us, why not keep a record of that work, rather than doing it again in the application? Note that the current Linux polling implementation is so slow that implementing poll2(2) will make little difference. However, once the optimisations I've proposed are added, it will make a difference (yes, I've done the benchmarks). Oh, and before you take my comments on the Linux polling implementation too far, note that I've benchmarked it against several other OSes, and Linux came out far better in any case. My point is that we can still do a lot better. You might ask, why implement poll2(2) if we have this clever userspace solution that avoids many of the problems? Well, the point is that it would speed up the application processing the inactive list of FDs, and that is always a good thing. However, it may turn out that the gain from poll2(2) is marginal. You could also argue that since poll2(2) is non-POSIX, non-UNIX, then why bother with it at all, and why not simply implement I/O readiness queues? Well, I could say that poll2(2) is only a small departure from UNIX whereas I/O readiness queues are more radical. However, this isn't a very strong argument. I'm waiting to see the results of the userspace solution before considering poll2(2) further. A note on the "thundering herd" problem: it's not really of much interest in this discussion. The "problem" arises because people attempt to increase concurrency of accepting new connections by having multiple threads all blocking waiting on select(2). Because the kernel wakes up all threads, rather than one thread per new connection, we have a "thundering herd" of freshly woken up threads. These problems are best solved by treating the accepting of incoming connections as just another case of I/O management, rather than special-casing them. Readers may find a recent paper presented at USENIX98 of interest. The author also has a list of other research on this topic. The paper essentially describes a mechanism for improving the speed of select(2) by adding internal state information to the kernel about FD activity. I invite comments or additions to this document. My purpose is to explain the various issues involved and serve as a primer for more debate. I hope to avoid recurring debates on the linux-kernel list which go over the same ground and either never contribute anything new to the arguments, or take many messages before something new is added. If you have a strong technical argument that I've missed, I'll be happy to add it to this document. Original: 22-JUN-1998 --<cut>-- -- J C Lawrence Internet: claw#null,net (Contractor) Internet: coder#ibm,net ---------(*) Internet: claw#under,engr.sgi.com ...Honourary Member of Clan McFud -- Teamer's Avenging Monolith... </PRE> <!--X-Body-of-Message-End--> <!--X-MsgBody-End--> <!--X-Follow-Ups--> <HR> <ul compact><li><strong>Follow-Ups</strong>: <ul> <li><strong><A NAME="01216" HREF="msg01216.html">[MUD-Dev] Re: I/O Event Handling Under Linux by Richard Gooch</A></strong> <ul compact><li><em>From:</em> s001gmu#nova,wright.edu</li></ul> </UL></LI></UL> <!--X-Follow-Ups-End--> <!--X-References--> <!--X-References-End--> <!--X-BotPNI--> <UL> <LI>Prev by Date: <STRONG><A HREF="msg01207.html">[MUD-Dev] Re: Suggested theme</A></STRONG> </LI> <LI>Next by Date: <STRONG><A HREF="msg01209.html">[MUD-Dev] Thread implementations...</A></STRONG> </LI> <LI>Prev by thread: <STRONG><A HREF="msg01209.html">[MUD-Dev] Thread implementations...</A></STRONG> </LI> <LI>Next by thread: <STRONG><A HREF="msg01216.html">[MUD-Dev] Re: I/O Event Handling Under Linux by Richard Gooch</A></STRONG> </LI> <LI>Index(es): <UL> <LI><A HREF="index.html#01208"><STRONG>Date</STRONG></A></LI> <LI><A HREF="thread.html#01208"><STRONG>Thread</STRONG></A></LI> </UL> </LI> </UL> <!--X-BotPNI-End--> <!--X-User-Footer--> <!--X-User-Footer-End--> <ul><li>Thread context: <BLOCKQUOTE><UL> <LI><strong><A NAME="01226" HREF="msg01226.html">[MUD-Dev] Quotes from Marcus Ranum</A></strong>, J C Lawrence <a href="mailto:claw#under,engr.sgi.com">claw#under,engr.sgi.com</a>, Fri 26 Jun 1998, 22:05 GMT <UL> <LI><strong><A NAME="01231" HREF="msg01231.html">[MUD-Dev] Re: Quotes from Marcus Ranum</A></strong>, Nathan F Yospe <a href="mailto:yospe#hawaii,edu">yospe#hawaii,edu</a>, Sat 27 Jun 1998, 02:57 GMT </LI> </UL> </LI> <LI><strong><A NAME="01215" HREF="msg01215.html">[MUD-Dev] Re: Suggested theme</A></strong>, Chris Gray <a href="mailto:cg#ami-cg,GraySage.Edmonton.AB.CA">cg#ami-cg,GraySage.Edmonton.AB.CA</a>, Fri 26 Jun 1998, 14:45 GMT <LI><strong><A NAME="01209" HREF="msg01209.html">[MUD-Dev] Thread implementations...</A></strong>, J C Lawrence <a href="mailto:claw#under,engr.sgi.com">claw#under,engr.sgi.com</a>, Fri 26 Jun 1998, 06:15 GMT <LI><strong><A NAME="01208" HREF="msg01208.html">[MUD-Dev] I/O Event Handling Under Linux by Richard Gooch</A></strong>, J C Lawrence <a href="mailto:claw#under,engr.sgi.com">claw#under,engr.sgi.com</a>, Fri 26 Jun 1998, 06:05 GMT <UL> <LI><strong><A NAME="01216" HREF="msg01216.html">[MUD-Dev] Re: I/O Event Handling Under Linux by Richard Gooch</A></strong>, s001gmu <a href="mailto:s001gmu#nova,wright.edu">s001gmu#nova,wright.edu</a>, Fri 26 Jun 1998, 16:45 GMT </LI> </UL> </LI> <LI><strong><A NAME="01206" HREF="msg01206.html">[MUD-Dev] (fwd) Dantom's Universal Network Game (DUNG)</A></strong>, Nathan Fenenga Yospe <a href="mailto:yospe#hawaii,edu">yospe#hawaii,edu</a>, Fri 26 Jun 1998, 05:48 GMT <LI><strong><A NAME="01201" HREF="msg01201.html">[MUD-Dev] ScryMUD: Source release, version 1.4</A></strong>, Ben Greear <a href="mailto:greear#cyberhighway,net">greear#cyberhighway,net</a>, Thu 25 Jun 1998, 06:10 GMT <LI><strong><A NAME="01196" HREF="msg01196.html">[MUD-Dev] Re: Suggested theme, was Re: WIRED: Kilers have more fun</A></strong>, Chris Gray <a href="mailto:cg#ami-cg,GraySage.Edmonton.AB.CA">cg#ami-cg,GraySage.Edmonton.AB.CA</a>, Thu 25 Jun 1998, 03:03 GMT </LI> </UL></BLOCKQUOTE> </ul> <hr> <center> [ <a href="../">Other Periods</a> | <a href="../../">Other mailing lists</a> | <a href="/search.php3">Search</a> ] </center> <hr> </body> </html>