Software TLB Management in OSF/1 and Mach 3.0 Richard Uhlig, David Nagle, Trevor Mudge, Stuart Sechrest Dept. of EECS, University of Michigan, Ann Arbor December 4, 1992 1.0 Abstract Anderson et al. 1991]. In their 1991 ASPLOS paper, Anderson et. al suggested that a major reason for these Many new computer architectures provide reduced hard- performance problems is a divergence of operating system ware support for virtual memory in the form of basic and computer architecture trends. They arrived at this contranslation buffer (TIB) hardware, software traps for TLB clusion by estimating the cost measuring the frequency of miss handling, and a small set of instructions for probing various primitive operating system functions such as conand changing TLB state. This means that operating system text switching, system call invocation and translation writers are responsible for choosing page table structure buffer (B) miss handling. Figure 1 reproduces some of and the policies governing the placement and replacement Anderson's results which compare the performance of the monolithic-kernel Mach 2.5 against the microkernel-based of page table entries in the TLB. If done carelessly, soft- monolithic-kernel Mach 2.5 against the microkernel-based ware TLB management can impose considerable penalties Mach 3.0. With the exception of instruction emulation, which are exacerbated by the structure of newer genera- kernel TLB (KTLB) miss handling is by far the most fretion operating systems such as OSF/1 and Mach 3.0. quently invoked operating system primitive considered in Anderson's study. Note that although address space and thread context switches increase most dramatically in the This work explores the current TLB management policies This work exprs te c t TBt p s Mach 3.0 system, their frequency is small when compared of OSF/1 and Mach 3.0-based systems and explains some of the reasons for their lower performance when compared against more traditional, monolithic-kernel designs such as Ultrix. We present a collection of improved TLB man- The R3000-based DECstation 5000 used in Anderson's agement techniques and measure their effectiveness with study is representative of a recent class of machines that OSF/1 and Mach 3.0. Although our experiments are per- require TLB misses to be serviced by OS software [Kane formed on a MIPS R2000-based machine, our techniques et al. 1992]. Other processor architectures with softwarehave been designed to exploit features of the newer MIPS managed TLBs include the HP-PA and the DEC Alpha R4000 TLB. [Hewlett-Packard 1990; Digital 1992]. Such machines require operating system writers to choose both page table structure and the policies governing placement and 2.0 Introduction replacement of page table entries (PTEs) in the TLB. Although the costs of software TLB management can be Microkernel operating systems such as Mach 3.0 offer 1 KTLB misses typically represent only about 5% to 10% of all many advantages [Accetta et al. 1986], but have always TLB miss types. When all TLB miss types are taken into suffered performance problems [Guillemont et al. 1992; account they outnumber even the emulated instructions. This work was supported by the Defense Advanced by the Defense Advanced Research Projects Agency under DARPA/ARO Contract Number DAAL)3-90-C0028, by National Science Foundation Grant No. CDA-9121888 and by a National Science Foundation Graduate Fellowship. 1 of 13

Analysis Tools and Method Operain Tmne Adds Space Thread Cotext System Emulated KTLB Other System (secoods) Switches Switches Calls Instructions Misss Exceptions Mach 2.5 307.2 10,740 18,225 90,605 2,650,879 439,067 173,937 Mach 3.0 416.5 215,811 248,274 278,771 5,233,938 3,857,602 390,385 Ratio 1.4:1 20.9: 1 13.6: 1 3.1:1 2.0: 1 8.8: 1 2.2:1 Table I Application Reliance on Operating System Primitives These data are derived from Table 7 of [Anderson et al. 1991]. The counts are the sum of OS primitive invocations caused by running a workload consisting of spellcheck, latex, andrew (local and remote), linking a vmunix kernel and parthenon (1 and 10 threads). The time is the total required to run all of these programs once on an R3000-based DECstation 5000. The ratios show the increase in time (or counts) for the Mach 3.0 system when compared against Mach 2.5. Their Mach 3.0 system (which is similar to our Mach 3.0 + AFS system described in Table 2) supports UNIX and AFS functionality in two separate user-level servers. substantial, there has been little investigation into this new 3.1 System Monitoring with Monster OS responsibility. The Monster monitoring system (see Figure 1) was develIn this work we examine the software TLB management oped to enable analyses of the interaction between operatproblem in depth, considering all types of TLB misses. To ing systems and computer architectures. Monster consists perform this work, we extend the range of systems studied of a DECstation 3100 whose motherboard has been modiby Anderson et al. to include Ultrix, OSF/1 and two ver- fied so that a logic analyzer can be attached to the CPU sions of Mach 3.0 (one with an AFS client cache manager pins. In this study, we use Monster to obtain accurate meaand one without). We also construct workloads that emu- surements of TLB miss handling costs. Monster's other late the activity of a system with many user-level servers capabilities are described more completely in a University and relate poor TLB performance to high degrees of sys- of Michigan technical report [Nagle et al. 1992]. tem decomposition. To mitigate this cost, we propose and measure the effectiveness of various improved TLB man- The logic analyzer component of Monster contains a proagement policies. Our experiments are performed on an grammable hardware state machine and a 128K-entry R2000-based DECstation 3100, but our results extend to trace buffer. The state machine includes pattern recognithe newer MIPS R4000 processor [Kane et al. 1992]. tion hardware that can sense the processor's address, data and control signals on every clock cycle. This state This paper is organized as follows. Section 3.0 describes machine can be programmed to trigger on certain patterns the tools, machines and operating systems used in our appearing at the CPU bus and then dump these signals and analysis. Section 4.0 is a case study of the current TLB a timestamp (with Ins resolution) into the trace buffer. management techniques used by Ultrix, OSF/1 and Mach 3,0 on a DECstation 3100. In Section 5.0 we suggest Timing software code paths also requires an instrumented improvements to TLB management in the baseline sys- OS kernel. For example, to measure the time spent in the tems and then measure their effectiveness. Section 6.0 software TLB miss handlers, the entry and exit points of summarizes the work. the handlers are annotated with distinctive nop marker instructions. The logic analyzer's state machine is programmed to detect the marker instructions which are 3.0 Analysis Tools and Method stored to the logic analyzer's buffer along with a timestamp while the system runs some workload of interest. The performance of low-level operating system primitives When the workload completes, the trace buffer is postis heavily dependent upon hardware parameters such as processed to obtain a histogram of time spent in the differinstruction and data cache size, TLB size and main mem- ent invocations of the TLB miss handlers. ory bandwidth. To take these effects into account, we use a combination of hardware- and software-based analysis This technique for timing code paths yields far more accutools to perform our analysis of TLB management costs. rate timings for single invocations of a code path than can Software TLB Management in OSF/1 and Mach 3.0 2 of 13

Analysis Tools and Method DECstation 3100 Motherboard Logic Analyzer Pod Figure 1 The Monster Monitoring System Monster is a hardware monitoring system consisting of a Tektronix DAS 9200 Logic Analyzer and a DECstation 3100 running three operating systems: Ultrix, OSF/1 and Mach 3.0. The DECstation motherboard has been modified to provide access to the CPU pins, which lie between the processor and the cache. This allows the logic analyzer to monitor virtually all system activity. To enable the logic analyzer to trigger on certain operating systems events, such as the servicing of a TLB miss, each operating system has been instrumented with special marker NOP instructions that indicate the entry and exit points of various routines. Operating System Description Benchmark Description Ultrix Version 3.1 of Ultrix, as shipped by Digital gcc The GNU cc compiler benchmark taken.-.-.-:-::Equipment Corporation. from the SPEC Benchmark Sui.....te SPEC OSF/1 Version 1.0 from the Open Software Founda- _______Ad___1991].__ tion (derived from Mach 2.5.) io Creates an 8 Megabyte file. Mach 3.0 Carnagie Mellon University's microkernel ver- Ousterhout John Ousterhout's benchmark suite from sion mk77 and UNIX server version uk38. [Ousterhout 1989]. Mach 3.0 + AFS Same as Mach 3.0, but with an AFS cache spell The UNIX spell utility applied to a 13K manager running as a separate task outside of word latex document | _____ —th~~e UNIX server." — Table 2 Operating Systems Used in This Study Table 3 Benchmarks Used in This Study be obtained by using a system clock with much coarser Each of these systems includes a standard Berkeley fast resolution or, as is often done, by executing a code frag- file system [McKusick et al. 1984]. An additional Mach ment in a loop N times and then dividing the total time 3.0-based system was created by adding the Andrew File spent by N [Clapp et al. 1986]. System (AFS) cache manager [Satyanarayanan 1990], running as a separate server task.2 To obtain measurei mrents, all of the opserating systems were instrumented with Moseiahadrmotrnssecoit counters and makers as outlined in the previous section. All experiments were performed on an R2000-based DEC- 2 The current version of the UNIX server from CMU includes station 3100 (16.7MHz) running three different base oper- AFS functionality. For our experiments, we have (partially) ating systems (see Table 2): Ultrix, OSF/1 and Mach 3.0. moved this AFS code into a separate task. modiSoftware TLB Manageissent in OSF/1 and Mach 3.0 3of 13 Software TLB Management in:: OSFI andf::;:::~~~~ Mach 3~T~T;r~~~.0 3 of 13T~ ~

Case Study User Kernel Data Data Page Page Level 1 User PTE L1 Each PTE maps one, 4K page of user text or data. Level 1 Kernel PTE Each PTE maps one, 4K page of Level 2 PTE kernel text or data. L2 Each level 2 PTE maps one, 1,024 entry user page table page Virtual Address Space L3 ---- Level 3 PTE Physical Address Space Each level 3 PTE maps 1 page of either level 2 PTEs or level 1 kernel PTEs. Figure 2 Page Table Structure in OSF/1 and Mach 3.0 The Mach page tables form a 3-level structure with the first two levels residing in virtual (mapped) space. The top of the page table structure holds the user pages which are mapped by level 1 user PTEs. These level 1 user PTEs are stored in the L1 page table with each task having its own set of L1 page tables. Mapping the LI page tables are the level 2 PTEs. They are stored in the L2 page tables which hold both level 2 PIEs and level 1 kernel PTEs. In turn, the L2 pages are mapped by the level 3 PTEs stored in the L3 page table. At boot time, the L3 page table is fixed in unmapped physical memory. This serves as an anchor to the page table hierarchy because references to the L3 page table do not go through the TLB. The MIPS R2000 architecture has a fixed 4 Kilobyte page size. Further, each PTE requires 4 bytes of storage. Therefore, a single L1 page table page can map 1,024 level 1 user PTEs, or approximately 4 Megabytes of virtual address space. Likewise, the L2 page tables can directly map either 4 Megabytes of kernel data or indirectly map 4 Gigabytes of L1 user data. Throughout the paper we use four benchmarks (see agement policies for the different operating systems on an Table 3). The same binaries were used on all operating R2000-based DECstation 3100 and then present our measystems. To improve accuracy, each measurement cited in surements of TLB miss handling costs in these systems. this paper is the average of three trials. 4.1 Page Tables and Translation Hardware 4.0 Case Study OSF/1 and Mach 3.0 on the DECstation both implement a In this section we describe the Ultrix, OSF/1 and Mach 3.0 linear3 page table structure (see Figure 2). These page page table structures and the hardware support for address tables are used to store the mapping of pages from a large translation provided by the R2000/R3000 and the newer R4000 microprocessors. We analyze the current TLB man- Rather than an verted pes table [Wilkes et a 1992] Software TLB Management in OSF/1 and Mach 3.0 4 of 13

Case Study R2000 and TLB Mis Type Ultrix OSF/1 Mach 3.0 R3000 TLB Level 1 User 20 20 20 Level 1 Kernel 294 294 294 R4000 TLB Level 2 407 407 407 Level 3 ------ 286 286 56 Slot Other 405 405 405 Upper Upper Partition ________,~~ ________ ~Partition Flexible Partition Table 4 Costs for Different TLB Miss Types Point with 48 Slots Total This table shows the number of machine cycles (at 60 T_. ns/cycle) required to service different types of TLB 8 Slot................. misses. To determine these costs, Monster was used to Lower Lower Parition collect a 128K-entry histogram of timings for each Partition type of miss. The histogram was then pruned by 25% to remove outliers caused by intervening interrupts, and the median value for each type of miss was found. Figure 3 The MIPS TLBs We separate TLB miss types into the five categories described below. Note that Ultrix doesn't have level 3 The MIPS R2000 and R3000 microprocessors both PTE misses because it implements a 2-level page table. have an on-chip, 64-entry, fully-associative TLB that supports fixed lower and upper partitions of 8 and 56 Level 1 User TLB miss on a level 1 user PTE. slots, respectively. R2000/R3000 PTEs map 4 Kbyte Level 1 Kernel TLB miss on a level 1 kernel PTE. pages. Level 2 miss on level 2 Ths cn The R4000 TLB is also on-chip and fully associative, Level 2 TLB miss on level 2 PTE. This can but it has 48 slots which can hold pairs of consecutive only occur after a miss on level 1 PTEs. The partition between upper and lower slots is user Plh. ~not fixed and can be set under software control. R4000 Level3 TLB miss on a level 3 PTE. Can PTEs map pages that can vary from 4 Kbytes to 16 occur after either a level 2 PTE miss Mbytes in size. or a level 1 kernel PTE miss. Other All other exceptional conditions tions. The R2000 supports two different types of TLB miss involving the TLB, including vectors. A special vector is used to trap on missing translaaccesses to invalid pages (page faults) and page protec va tion viola-ons for level 1 user pages. This special vector is justified tions. by the fact that level 1 user PTE misses are assumed to be the most frequent of all PTE types [DeMoney et al. 1986]. virtual address space to a more limited, physical address All other TLB miss types (such as those caused by referspace. Each task has its own L1 page table, which is main- ences to kernel pages, invalid pages or write-protected tained by machine-independent pmap code [Rashid et al. pages) are vectored to a generic exception vector.4 This 1988]. Because the user page tables can amount to several broad collection of misses is the same as the KTLB misses megabytes, they are themselves paged. This is supported described in Anderson's study [Anderson et al. 1991]. through L2 (or kernel) page tables, which also map other kernel data. Because kernel data is relatively large and For the purposes of this study, we refine the definition of sparse, the L2 page tables are also mapped. This gives rise TLB miss types (see Table 4) to correspond to the page to a 3-level page table hierarchy and a variety of different table structure implemented by OSF/1 and Mach 3.0. In page table entry (PIE) types. addition to level 1 user TLB misses, we define four subcategories of KTLB misses (level 1 kernel, level 2, level 3 and other). Table 4 also shows our measurements of the The R2000 processor contains a 64-slot, fully-associative ad oter. ae ao ho or ea s of the time required to handle the different types of TLB misses. TLB, which is used to cache recently-used PTEs. When e diff t..... mi the R2000 translates a virtual address to a physical The wide differential in costs is primarily due to the twc the R2000 translates a virtual address to a physical address, the relevant PTE must be held by the TLB. If it is 4. In addition to these TLB related traps, the generic vector is absent, the hardware invokes a trap to a software TLB shared among all other interrupts, exceptions and traps such as miss handling routine that finds and inserts the missing clock and device interrupts, arithmetic exceptions and system PTE into the TLB by using various TLB control instruc- calls. Software TLB Management in OSF/1 and Mach 3.0 5 of 13

Case Study different miss vectors and the way that the OS uses them. Level 1 user PTEs can be retrieved within 20 cycles Mach 3.0 because they can be serviced by a highly-tuned handler VPE Type Uix OSF/1 Mach 3.0 +AFS inserted at the special vector. However, all other miss Level User 1776190 2111,741 2 771,397 7,417,212 types require from about 300 to over 400 cycles because Level er 2 Level I Kernel 27,112 541,323 42,615 172,649 they are serviced by a generic handler that is inserted at Level,1, 2, 2, * * Level 2 420 1,449 23,448 206,257 the generic exception vector. vel 2 Level 3 - 56,365 45,493 135,764 Other 21,291 47,841 54,140 77,733 The R2000 TLB hardware supports partitioning of the T,825013 758719 937093 7,855140 TLB into two sets of slots. The lower partition is intended for infrequently referenced PTEs with high retrieval costs, Table 5 TLB Miss Counts in the Base Systems while the upper partition is intended to hold more frequently used PTEs that can be re-fetched more quickly. These are the total occurrences of TLB misses for each The TLB hardware also supports random replacement of of the different PTE types. The counts are the total obtained by running each of the workloads from PTEs in the upper partition through a hardware index reg- Table 3. ister that returns random numbers in the range 8 to 63. This effectively fixes the TLB partition at 8, so that the lower partition consists of slots 0 through 7, while the So, level 3 PTE use is infrequent unless the level 2 and upper partition consists of slots 8 through 63. level 1 kernel PTE miss rates are high. The varying retrieval costs and interrelationships between The TLB of the MIPS R4000 differs from the different PTE types are the basis for TLB management R2000/R3000 TLB design in four significant ways (see policies such as determining into which TLB partition a Figure 3) [Kane et al. 1992]. First, the slot partition point given PE type should be placed, and what the replaceis adjustable because the range of numbers generated by ment policy for evicting PTEs from the TLB should be. In the random index register can be changed under software more flexible TB design, such as in the R4000, an addicontrol. Second, level 1 kernel PTE misses are eligible to onal management concern is the setting of the partibe serviced by the same special miss handler that fixes poin level 1 user PTE misses. This reduces their retrieval time by an order of magnitude. Third, page size in the R4000 is variable from 4 Kbytes to 16 Mbytes. Finally, the R4000 4.3 TLB Management in the Baseline Systems TLB has only 48 slots, but each slot can hold a pair of PTEs if they are contiguous in the virtual address space. In Each of the systems that we studied use similar TLB manthis study, we are primarily interested in the first two dif- agement policies. All three systems use the 8-slot lower ferences. 42 The TLB Management Problem Mach 3.0 PTE Miss Type Ultrix OSF/1 Mach 3.0 + AFS The challenge of TLB management is to minimize overall Level 1 User 213 253 3.33 8.90 TLB miss handling time through the allocation of a fixed Level 1 Kernel 0.48 9.55 0.73 2.96 number of TLB slots to the different types of PTEs. As we shall see, optimal TLB management is dependent upon the Leve2. 1 0.04 0.57 5.04 structure of the operating system and its use of virtual Leve3 97 80 239 memory. The management problem is also complicated by Other 0.52 1.16 132 189 the varying costs to retrieve the different PTE types and by Total 3.14 14.25 6.75 21.18 the different patterns with which they are accessed. For example, level 1 user PTEs are used on every memory ref- Table 6 TLB Miss Costs in the Base Systems erence by a user task, but level 2 PTEs are only needed This table shows the total cost (in seconds) to service when servicing level 1 PTE misses. Thus, level 2 PTEs are each of the different TLB miss types. The costs were used less frequently but are much more costly to retrieve if obtained by combining the counts from Table 5 with they are missing from the TLB. Level 3 PTEs are only the code path timings of Table 4. used when servicing level 2 or level 1 kernel PTE misses. Software TLB Management in OSF/1 and Mach 3.0 6 of 13

Better TLB Management TLB partition to hold level 2 PTEs and the 56-slot upper tion of the AFS server. We will explore the relationship partition to hold all other PTE types. All systems use a between level 2 misses and the number of user-level OS first-in-first-out (FIFO) replacement policy in the lower servers more completely in subsequent sections. partition and random replacement in the upper partition. The data in Table 5 and Table 6 demonstrate a breakdown Although management policies between the systems are of the old, Ultrix-style TLB management policies when mostly identical, there are some differences. First, because carried over to the OSF/1 and Mach 3.0-based systems. In Ultrix implements a 2-level page table structure, it has no the next section, we will explore a collection of TLB manlevel 3 PTEs. Second, the version of OSF/1 shipped by agement techniques that are more effective for the OSF/1 OSF actually doesn't use the lower partition at all; OSF/1 and Mach 3.0-based systems. mixes all PTE types in the same 56 slots of the upper partition, throwing away 12.5% of the available TLB resource. We considered this a code bug and modified the OSF/1 5.0 Better TLB Management miss handlers to utilize the lower partition according to the same policies used by Ultrix and Mach 3.0. When we set out to formulate new TLB management techniques, we asked ourselves several questions about the We measured TLB performance under these management existing policies and how they might be changed. These policies for each of the four systems. Table 5 shows the questions are listed below and the answers we arrived at TLB miss counts, while Table 6 shows the cost (in sec- can be found in the following subsections. onds) to service the different types of misses. Although all ~ What shouM the PTE placement policy be? That is, of these operating systems use similar TLB management hat shod the PE c e ht s policies and ran the same workload with identical binaries, nto which n or lower) should the 4 various PTE types be placed? their TLB performance varies widely. *What policies should guide adjustment of the partition The policies described above work well for Ultrix, which point? What factors change the position of the optimal spends only 3.14 seconds (out of 281 seconds total) han- partition point? dling TLB misses. By separating level 2 PTEs from the * Which replacement policies are most effective for the other PTE types, the miss counts for this expensive PTE two different partitions? That is, when a new PTE is type are kept very low. Because of its monolithic-kernel inserted into the TLB, should the PTE that is evicted be structure, Ultrix is also able to store all of its program text selected by a FIFO or a random policy? and most of its data structures in unmapped space. This is What are the benefits of reducing miss costs for PTE evident by the low rates of level 1 kernel PTE misses. types other than the mostfrequently used level 1 user Note that Ultrix, with its 2-level page table structure, also PTEs? completely avoids the miss handling costs associated with the level 3 PTEs of the OSF/1 and Mach 3.0-based systems. 5.1 PTE Placement and Partitioning All of the baseline systems place level 2 PTEs into the OSF/1 spends a much greater amount of time (14.25 sec- lower partition and all other PTEs in the upper partition. onds) handling TLB misses. This is primarily due to a high Level 1 user PTEs, which have a low retrieval cost, are component of level 1 kernel misses. Although OSF/1 is a separated from level 2 PTEs because they have a much monolithic-style kernel like Ultrix, it maps a greater por- higher retrieval cost. However, it is unclear why the basetion of its data structures because of its dynamic kernel line systems mix the costly-to-retrieve level 1 kernel and memory allocator that relies on virtual memory support. level 3 PTEs with level 1 user PTEs. Furthermore, if mixing costly-to-retrieve mappings with level 1 user PTEs is The Mach 3.0-based systems also spend a greater deal of acceptable, it is unclear if PTE partitioning is required at time handling TLB misses, either about 6 or 20 seconds all. depending on whether or not there is an additional userlevel AFS server in the system. The large jump in level 1 To determine the effects of other PIE placement choices, user misses is due to the migration of OS services from the we modified the miss handlers of the baseline systems to Mach 3.0 microkernel into the user-level UNIX server implement other meaningful policies. The results are where they must be mapped. The Mach 3.0+AFS system shown in Table 7. Policy A is identical to that implealso shows a large jump in level 2 misses due to the addi- mented by the baseline systems. The importance of some Software TLB Management in OSF/1 and Mach 3.0 7 of 13

Better TLB Management sort of partitioning is shown by Policy D, where all PTEs are mixed together, and which demonstrates very poor per- 20 formance. At first glance, the baseline policy A appears to 18 Upper PTEs be the most desirable. However, note that with policies B and C, the lower partition was not permitted to grow n Total beyond 8 slots to accommodate the additional PTE types 14 allocated to this partition.. 12 o 10 To see if the performance of policies B and C would ci 8 improve with a larger lower partition, we modified the 6 baseline miss handlers to vary the partition point from its 4 fixed location at 8. Although the R2000 TLB only effi- 2 ciently supports a partition point of 8, it is possible to over- come this limitation by generating the random 640 replacement index in software. Although the resulting han-6 2 8 dlers are slowed down to the extent that this technique Partition oint doesn't make sense for usual system operation, it does Figure 4 Changing the TLB Partition Point enable us to emulate partition point movement for an architecture, like the R4000, that better supports this oper- This exement was erformed on the Mach 3.0 system implementing PTE placement policy B (see ation in hardware. Figure 4 shows the effect of changing Table 7). Each data point represents a complete run of the partition point for a Mach 3.0 system that implements the benchmark suite with the partition statically fixed policy B. The graph shows the varying cost of PTE misses at given point for the duration of the run. The x-axis from the upper and lower TLB partitions as the partition (partition point) shows the number of slots allocated point is changed. The sum of these two competing costs is to the lower partition. The upper partition always consists of (64 - partition point) slots. plotted as a third curve that shows an optimal region in which the partition point would best be set. Figure 5 shows the results for this same experiment performed for PTE placement policies A, B and C. Only the 24 Policy A total curves are shown. The constant line that represents 22 - ^ Policy B policy D (no partition) is off the scale of this plot, and is 20 therefore not shown. Note that each of the policies have 18 0 Policy C 16: 12 Policy PTE Placement Cost A Level 2 PTEs in lower partition. 6.46 8 All other PTEs in upper partition. _ B Level 2 and 3 PTEs in lower partition. 9.62 Level 1 user and kernel PTEs in upper partition. 4 C All PTEs in lower partition, except for level 1 8.06 2 user PTEs. which are placed in upper partition. 0 D No partitioning at all. 55.15 4 8 12 16 20 24 28 32 36 40 Partition Point Table 7 Alternate PTE Placement Policies Figure 5 TLB Partitioning for Different PTE This table shows the affect of alternate PTE placement Placement Policies policies on TLB management cost. The cost is the total time spent (in seconds) to handle TLB misses for the This is the same experiment as that of Figure 4, except Mach 3.0 system running all of the benchmarks listed with different PTE placement policies (see Table 7). in Table 3. The partition point is fixed at 8 slots for the Only the total TLB management costs are shown. The lower partition and 56 slot for the upper partition. total cost for policy D (no partition) is off the scale of the plot at 55.15 seconds. Software TLB Management in OSF/1 and Mach 3.0 8 of 13

Better TLB Management -- Mach 3.0+AFSserver 2.5 14 12 -&. C Mach 3.0 10 l \ I.-g OSF/1 2: 2- - X 4 8 12 Partition Point ^ 1 p 3 0.5 Figure 6 Optimal TLB Partitioning for Different Operating Systems (Tas) ^ for the three systems using PTE placement policy A. 0 -.. - The partition point is varied as described in Figure 4. 5 9 13 17 21 25 29 33 Partition Point different optimal points, but at these optimal points, the gure 7 LB Partitioning under Multi-server performance is roughly the same for each system. From Operating Systems this we can conclude that the most important PTE plae- This graph shows total TB management costs for ment policy is the separation of level 1 user and level 2 Mach 3.0 running a workload that emulates a multiPTEs (to avoid the very poor performance of policy D). server system by passing a token among different numHowever, the differences between the other PTE place- bers of user-level tasks. ment policies A, B and C are negligible, provided that the partition point is tuned to the optimal region. A key characteristic of microkernel system structuring is that logically independent services reside in separate userOur studies show that the optimal partitioning point is also level tasks that communicate through message passing. dependent upon OS structure. To illustrate this point, we The degree of service decomposition can be limited for ran our benchmark suite under the OSF/1, Mach 3.0 and performance reasons. However, careful software manageMach 3.0+AFS systems and measured the time to service ment of the TLB can extend the degree to which separate level 2 PTE misses. The results are shown in Figure 6. services can coexist in a system before performance These curves illustrate that as operating system services degrades to an unacceptable level. To illustrate this more are migrated from kernel space (in OSF/1) to user space clearly, we constructed a workload that emulates the inter(in the Mach 3.0-based systems), the reliance upon level 2 action between servers in a multi-server microkernel OS. PTEs increases. This is because each user-level task In this workload, a collection of user-level tasks mimic the requires a minimum of 3 level 2 PTEs to cover its text, behavior of communicating OS servers by passing a token data and stack segments. The OSF/1 system requires only between each other. The number of servers and the num5 TLB slots5 for level 2 PTEs in order to avoid TLB ber of pages that each server touches before passing the thrashing. However, the Mach 3.0+AFS system requires token along can be varied. Figure 7 shows the results of 6 additional TLB slots to hold level 2 PTEs for its user- unning this multi-server emulator on the Mach 3.0 kernel. level UNIX andAFS servers, bringing the total to 11. With each additional server, the optimal partitioning point moves farther to the right. A system that leaves the parti5' 2 PTEs are used to map kernel data structures and 3 additional tion point fixed at 8 will quickly encounter a performance PTEs are required for the UNIX process running in an OSF/1 bottleneck due to the addition of servers. However, if the task. TLB partition point is adjusted to account for the number Software TLB Management in OSF/1 and Mach 3.0 9 of 13

Better TLB Management Optimal Partiton TLB Management Fixed Static Dynamic Workload Point Cost Workload Partitioning Partitoning Parttioning Ousterhout 18 1.27 Ousterhout 3.92 1.27 1.11 spell 25 0.11 spell 0.28 0.11 0.11 gcc 14 4.39 gcc 4.94 4.39 4.28 io 32 0.43 io 1.30 0.43 0.43 Table 8 Optimal TLB Partitioning for Different Workloads Table 9 Different TLB Partitioning Schemes These data were obtained by running the same expei- This table compares the total TLB management costs ment as described in Figure 4 (Mach 3.0 + policy B). (in seconds) when fixing the partition at 8, when setOnly the management cost (in seconds) for the optimal ting it to the static optimal point, and when using partition point of each benchmark is shown. dynamic partitioning. The PTE placement policy is B. of interacting servers in the system, a much greater num- which time it decides to move the partition point either up, ber of servers can be accommodated. Nevertheless, note down, or not at all. It is based on a hill-climbing approach that as more servers are added, the optimal point still tends where the objective function is computed from the two to shift upwards, limiting the number of tightly-coupled most recent settings of the partition point. At each invocaservers that can coexist in the system. This bottleneck is tion, the algorithm tests to see if the most recent partition best dealt with through additional hardware support in the change resulted in a significant increase or decrease in form of larger TLBs or miss vectors dedicated to level 2 TLB miss handling costs when compared against the prePTE misses. vious setting. If so, the partition point is adjusted as appropriate. This algorithm tends to hone-in on the optimal In addition to OS structure, the optimal TLB partition partition point and tracks this point as it changes with point is also dependent upon workload. This is shown in time. Table 8 where we determined the optimal partition for each of the four different programs in our benchmark each of the four different programs in our benchmark We tested this algorithm on each of the benchmarks in our suite. Note that the optimal point is often very far away s from the point fixed by the R2000 hardware (at 8). suite and compared the resultant miss handling costs from the point fixed by the R2000 hardware (at 8). 5.2 Dynamic Partitioning We have shown that the best place to set the TLB partition point varies depending upon the PTE placement policy, E 15 operating system structure, and workload. Given knowl- ^ edge of these factors ahead of time, it is possible to deter- C mine the optimal partition point aqd then statically fix it. 10 for the duration of some processing run. However, although an operating system can control PTE placement La 5 policy and have knowledge of its own structure, it can do little to predict the nature of future workloads that it must 0 service. Although system administrators might have knowledge of the sort of workloads that are typically run Time at a given installation, parameters that must be tuned manually are often left untouched or are set incorrectly. Figure 8 Dynamic TLB Partitioning during g Figure 8 Dynamic TLB Partitioning during gcc To address these problems, we have designed and imple- This graph shows the movement of the TLB partition mented a simple, adaptive algorithm that dynamically self- with time while running gcc on a system that impletunes the TLB partition to the optimal point. The algo- ments the dynamic, adaptive partitioning algorithm. rithm is invoked after some fixed number of TLB misses at Software TLB Management in OSF/1 and Mach 3.0 10 of 13

Better TLB Management against runs that fix the partition at 8 and at the static opti- mal point for the given benchmark. The results are shown RandomPolicy in Table 9. Note that dynamic partitioning yields results n Random-Policy B that are at times slightly better than the static optimal. To 22 explain this effect, we performed another experiment that 20 - FIFO-Policy A records the movement of the TLB partition point during 18 the run of the gcc benchmark. The results (see Figure 8) 16- FIFO-Policy B show that the optimal partition point changes with time as 14 a benchmark moves among its working sets. Because the - 12 dynamic partitioning algorithm tracks the optimal point 8 10 with time, it has the potential to give slightly better results uc 8 than the static optimal which remains fixed at some "good 6 average point" for the duration of a run. 4 The invocation period for the dynamic partitioning algo- 2 rithm can be set so that its overhead is minimal. It should 0. - be noted, however, that there is an additional cost for 12 28 maintaining the TLB miss counts that are required to com- Partition Point pute the objective function. Although this cost is negligible for the already costly level 2 and level 3 PTE misses, it Figure 9 Replacement Policies is more substantial for the highly-tuned level 1 PTE miss,(<.~~~~~ l~'E1~~~This graph shows the performance of different replacehandler.6 Hardware support in the form of a register that ment polies (random or FIFO) for the Mach 3.0 syscounts level 1 misses could help to reduce this cost. tern implementing PTE placement policies A or B. 5.3 Replacement Policies The baseline systems use a FIFO replacement policy for Level Level I the lower partition and a random replacement policy for OSF/1 User Kernel Level 2 Level 3 the upper partition when selecting a PTE to evict from the Counts 2,111,741 541,323 1,449 56365 TLB after a miss. To explore the effects of using other %ofTota isses 76.55% 19.62% 0.05% 2.04% replacement policies in these two partitions, we modified Previos C P 20 294 407 286 Previous Cost Per 20 294 407 286 the miss handlers to try other combinations of FIFO or Miss (cycles) random replacement in the upper and lower partitions. The New Cost Per Miss 220 50 100 results of this experiment are shown in Figure 9. For these (cycles) workloads, differences between the replacement policies Previous Total 2.53 9.55 0.04 0.97 are negligible over the full range of TLB partition points, Cost (seconds) indicating that the choice of replacement policy is not very New Total 2.53 0.65 0.00 0.34 important. Cost (seconds) Total Time Saved 0.00 8.90 0.04 0.63 5.4 Reducing Miss Costs Table 10 Reducing PTE Miss Costs for OSF/1 The design of the R2000 TLB was influenced by previous These computations are based on the counts of Table 5 research results that showed level 1 user PTE misses to be for OSF/1. We estimated the minimal achievable miss by far the most frequent type of TLB miss, occurring costs for each of the PTE types and recomputed the greater than 95% of the time [Clark et al. 1985]. This was resultant total handling times. The category of "other the justification for providing special vectoring support misses" is too complex (and infrequently occurring) to thjsifcainopr__________ ng___ suportJjustify any attempt at reducing its handling cost, and is /~~~6.~~~~~~~~~~~ ~~therefore not shown here. Maintaining a memory-resident counter in the level 1 miss handler requires a load-increment-store sequence. On the R2000, this can require anywhere from 4 to 10 cycles, depending on whether the memory reference hits the data cache. This is a 20% only for this miss type in the R2000 TLB [DeMoney et al. to 50% increase over the 20-cycle average currently required by 1986]. Although our measurements for the older-style the level 1 miss handler. Ultrix OS support this assumption (recall Table 5), it Software TLB Management in OSF/1 and Mach 3.0 11 of 13

Summary and allocates a kernel stack. Although this state saving Levd 1 Leved code is required to handle more complex interrupts, excepMach 3.0 +AFS User Kernel Levd 2 Level 3 tions and kernel traps, it is not needed for TLB miss hanCounts -7,417,212 172,649 206,257 135 764 dlers which simply index into the appropriate page table ----- -- TW -- 9- 2for a PTE and insert it into the TLB. We are currently % fTota Missoes 92.60% 2.16% 2.58% 1.70% implementing this in the OSF/1 and Mach 3.0-based sysPrevious Cost Per 20 294 407 286 tems to test the approach. tems to test the approach. Miss (cycles) New Cost Per Miss 20 20 50 100 (cycles) Previous Total 8.90 2.96 5.04 2.39 6.0 Summary Cost (seconds) New Total 8.90 0.21 0.62 0.81 TLBs that require software management are becoming Cos (seconds) ___more commonplace, implying that operating system writTotal Time Saved 0.00 2.75 4.42 1.58 ers are charged with the new duty of writing TLB miss handlers. This task is complicated by the large number of Table 11 Reduced Miss Costs in Mach 3.0 +AFS management policy choices and the special requirements These computations are the same as described in the of newer generation operating systems like OSF/1 and caption for Table 10, but they apply to the Mach 3.0 + Mach 3.0. AFS system. Our studies show that the most important management begins to break down in the newer OSF/1 and Mach 3.0 policies regard the partitioning of TLB slots to separate systems. level 1 user PTEs from level 2 PTEs. Other policies governing PTE placement or replacement are mostly irrelevant, provided that the partition point is tuned to the We wanted to quantify the benefit of reducing miss han- optimal region for the choices selected. The optimal partidling costs for other PTE types to determine if the addi- tion point varies not only with management policy, but tional hardware or software needed to accomplish this is also with operating system structure, workload and time. justified. Our computations are shown in Table 10 (for the We have shown that partition point tuning can be perOSF/1 system) and Table 11 (for the Mach 3.0 + AFS sys- formed automatically through a simple, adaptive algotem). The OSF/1 system benefits substantially from a rithm that extends the number of user-level servers that a reduced level 1 kernel miss cost due to its heavy use of microkernel OS can efficiently support. mapped kernel data structures. The Mach 3.0 + AFS system benefits mostly from a reduced level 2 miss cost due to its increased degree of service decomposition. Although Reducing the cost of handling individual PTE misses also both systems benefit from a reduced level 3 miss cost, the proves to be an effective TLB management technique. improvements are slight and probably do not justify the Although older operating systems only justify special effort required to reduce the handling time for this miss treatment of the level 1 user PTE misses, the newer OSF/1 type. and Mach 3.0 systems also show high levels of level 1 kernel and level 2 PTE misses. Special hardware support or more careful coding of the miss handlers for these PTE Miss costs can be reduced either through better hardware types is worthwhile. support or through more careful coding of the software miss handlers. The R4000 lowers the cost of level 1 kernel A o n R misses by allowing them to be serviced by the special TLB machin our studlts n tecn es exend to the MIPS miss vector. So, we assumed that the R2000 could handle machne, our results and techniques extend to the MIPS R4000 and to other multi-server operating systems such as level 1 kernel misses in the same way the R4000 does and Wndows NT [Custer 1993] estimated the cost to be the same as the level 1 user miss cost (about 20 cycles). For level 2 and level 3 misses, we believe that a software approach could be used to reduce these costs to about 50 and 100 cycles, respectively. This 7.0 Acknowledgments would be implemented by testing for level 2 and level 3 PTE misses at the beginning of the:generic exception vec- We thank Mary Thompson for her patience and assistance tor, before invocation of the code that saves register state as we assembled our OSF/1 and Mach 3.0 systems. We Software TLB Management in OSF/1 and Mach 3.0 12 of 13

References also extend thanks to Richard Draves and Brian Bershad [Nagle et al. 1992] D. Nagle, R. Uhlig and T. Mudge. Monfor helping us to interpret our results. ster: A Tool for Analyzing the Interaction Between Operating Systems and Computer Architectures. The University of Michigan. 1992. 8.0 References______ [Ousterhout 1989] J. Ousterhout. "Why aren't operating systems getting faster as fast as hardware." WRL Technical [Accetta et al. 1986] M. Accetta, R. Baron, et al. Mach: A Note (TN-11): 1989. new kernelfoundation for UNIX development, In Summer 1986 USENIX Conference, USENIX, 1986. [Rashid et al. 1988] R. Rashid, A. Tevanian, et al. "Machine-Independent Virtual Memory Management for [Anderson et al. 1991] TE. Anderson, H.M. Levy, B.N. Paged Uniprocessor and Multiprocessor Architectures." Bershad and ED. Lazowska. The interaction of architec- IEEE Transactions on Computers 37 (8): 896-908, 1988. ture and operating system design, In Fourth International Conference on Architectural Support for Programming [Satyanarayanan 1990] M. Satyanarayanan. "Scalable, seLanguages and Operating Systems, Santa Clara, Califor- cure, and highly available distributed file access." IEEE nia, ACM, 108-119, 1991. Computer 23 (5): 9-21, 1990. [Clapp et al. 1986] R.M. Clapp, LJ. Duchesneau, R.A. [SPEC 1991] SPEC. The SPECBenchmark Suite. 1991. Volz, T.N. Mudge, and T. Schultze,'Toward Real-time Performance Benchmarks for Ada," Communications of the [Wilkes et al. 1992] J. Wilkes and B. Sears. A comparison ACM, vol. 29, no. 8, Aug. 1986, pp. 760-778 of protection lookaside buffers and the PA-RISC protection architecture. HP Laboratories. 1992. [Clark et al. 1985] D.W. Clark and J.S. Emer. "Performance of the VAX-11/780 translation buffer. Simulation and measurement." ACM Transactions on Computer Systems 3 (1): 31-62, 1985. [Custer 1993] H. Custer. Inside Windows NT. Redmond, Washington, Microsoft Press, 1993. [DeMoney et al. 1986] M. DeMoney, J. Moore and J. Mashey. Operating system support on a RISC, In COMPCON, 138-143, 1986. [Digital 1992] Digital. Alpha Architecture Handbook. USA, Digital Equipment Corporation, 1992. [Guillemont et al. 91] M. Guillemont, J. Lipkis, D. Orr, and Marc Rozier. A second-generation micro-kernel based UNIX; Lessons in performance and compatibility. In USENIX Winter 1991 Proceedings, Dallas, Texas, USENIX. [Hewlett-Packard 1990] Hewlett-Packard. PA-RISC 1.1 Architecture and Instruction Set Reference Manual. Hewlett-Packard, Inc., 1990. [Kane et al. 1992] G. Kane and J. Heinrich. MIPS RISCArchitecture. Prentice-Hall, Inc., 1992. [McKusick et al. 1984] M.K. McKusick, W.N. Joy, SJ. Leffler and R.S. Fabry. "A fast file system for UNIX." ACM Transactions on Computer Systems 2 (3): 181-197, 1984. Software TLB Management in OSF/1 and Mach 3.0 13 of 13