Linux: a Portable Operating System

Linus Torvalds

Helsinki January 31, 1997

Master of Science Thesis

UNIVERSITY OF HELSINKI

Department of Computer Science

HELSINGIN YLIOPISTO — HELSINGFORS UNIVERSITET — UNIVERSITY OF HELSINKI

Tiedekunta/Osasto Fakultet/Sektion Faculty Laitos Institution Department

—

Faculty of Science

Tekijä Författare Author

Linus Torvalds

Työn nimi Arbetets titel Title

Department of Computer Science

—

Linux: a Portable Operating System

Oppiaine Läroämne Subject

Computer Science

Työn laji Arbetets art

Master of Science thesis

Tiivistelmä — Referat Abstract

—

Level

Aika

—

Datum

—

Month and year

Sivumäärä

—

Sidoantal

—

Number of pages

52 p.

January 1997

—

We explore the hardware portability issues in Linux that were uncovered when porting the operating

system to multiple CPU and bus architectures. We also discuss software interface portability issues,

especially with regard to binary compatibility with other operating systems that can share the same

hardware platform. The approach taken in Linux is described, with a few example architectures

covered in some more detail.

Computing Reviews Classiﬁcation:

D.4 (Operating Systems), C.1 (Processor Architectures), D.2.7 (Software Engineering — Portabil-

ity), D.4.7 (Operating Systems — Organization and Design)

Avainsanat

Linux, Operating System, Portability, Alpha, Sparc

Säilytyspaikka — Förvaringsställe Where deposited

Department of Computer Science Library,

Muita tietoja Övriga uppgifter Additional information

—

Nyckelord

—

Keywords

—

serial number C–1997–12

—

Acknowledgements

While the Linux project has been closely associated with me personally, partly due

to the name, I would like to make it very clear that the Linux operating system is

a huge project done co-operatively by lots of people all over the world.

Even if you discount all the user-level programs that are an integral part of any

running Linux system, just the kernel contains code from hundreds of people from

all around the world.

Thanks to all of you.

Contents

1 Introduction

1

2 Linux — the Design and Implementation

7

8

2.1 The design — compatibility . . . . . . . . . . . . . . . . . . . . . .

2.2 Implementation . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.3 Kernel organization . . . . . . . . . . . . . . . . . . . . . . . . . . .

2.4 The virtual machine implementation . . . . . . . . . . . . . . . . .

9

10

11

3 Software Interface Portability Issues

14

15

17

3.1 The implementation of the software interface . . . . . . . . . . . . .

3.2 Personalities . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

4 Hardware Portability Issues

19

20

22

24

26

29

30

33

34

38

39

42

43

44

45

4.1 Data representation portability issues . . . . . . . . . . . . . . . . .

4.1.1

4.1.2

Diﬀerent data sizes . . . . . . . . . . . . . . . . . . . . . . .

The impact of alignment restrictions and byte order . . . . .

4.2 Kernel memory management . . . . . . . . . . . . . . . . . . . . . .

4.2.1

4.2.2

Memory management through virtual page tables . . . . . .

Page table mapping coherence . . . . . . . . . . . . . . . . .

4.3 Ensuring cache coherency and atomic operations . . . . . . . . . . .

4.3.1

4.3.2

4.3.3

Instruction cache coherency . . . . . . . . . . . . . . . . . .

Data cache coherency . . . . . . . . . . . . . . . . . . . . . .

The case against virtual data caches . . . . . . . . . . . . .

4.4 Multiprocessor issues . . . . . . . . . . . . . . . . . . . . . . . . . .

4.4.1

4.4.2

Symmetric multi-processing . . . . . . . . . . . . . . . . . .

Massively parallel systems . . . . . . . . . . . . . . . . . . .

4.5 Device drivers: accessing the I/O bus . . . . . . . . . . . . . . . . .

4.5.1

4.5.2

Proprietary buses . . . . . . . . . . . . . . . . . . . . . . . .

PCI — the emerging standard . . . . . . . . . . . . . . . . .

5 Example Architectures

46

48

5.1 The Alpha . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

5.2 The Sparc line of computers . . . . . . . . . . . . . . . . . . . . . .

6 Future Work

References

51

53

1 Introduction

This thesis is an introduction to the portability issues of the Linux kernel. It is

not only meant to be my Masters thesis, I hope it will actually be useful for people

who want to port Linux to other architectures, or for people who just want to know

the design issues and goals of Linux on diﬀerent platforms (as opposed to the more

general goals of Linux in general).

In an attempt to make this practical, I try to explain the general issues, usually

followed by a comment on one or more speciﬁc ports with some pointers on how

something is implemented in practice¹.

Note that while the term “Linux” is often used to denote the whole system that

is built up from the basic kernel and all the system and user applications that are

usually running on top of the kernel, in the context of this thesis only the kernel itself

is considered. The portability of system and user applications is here considered a

totally diﬀerent issue, even though there are obviously some common concerns.

Also, you should not expect this to be a line-by-line (or even very low-level)

explanation of the kernel — rather the opposite. Instead of trying to explain what

all the speciﬁc functions do, this paper tries to explain the basic ideas, and the

speciﬁc examples should be seen as just clarifying how Linux works on some speciﬁc

hardware in some speciﬁc case.

The thesis generally follows a common format: each section ﬁrst explains the

generic problems and potential solutions, to then be followed by an “Implementa-

tion” section that gives Linux-speciﬁc implementation details. The reader should

be able to follow the text even with little or no knowledge of Linux itself, although

a general knowledge of the issues is obviously expected.

History of Linux portability

People who have followed Linux from the very beginning may ﬁnd the title of this

paper, “Linux: a Portable Operating System”, a rather ironic statement. Being

portable was not what Linux was about initially; the early versions of Linux were

extremely unportable.

¹For the same reason this paper will also be available in an electronic version which I will try

to keep up-to-date as people send me comments or as changes to the system occur. Linux is by

no means a static system, and some of the issues, especially when it comes to multiple CPU’s, are

so young that major changes can still happen.

1

The original Linux was not only extremely PC-centric, it wallowed in features

available on PC’s and was totally unconcerned with most portability issues other

than at a user level. The original Intel 80386 architecture that Linux was written

for is perhaps the example of current CISC design, and has high-level support for

features other current CPU’s would not even dream about implementing (see for

example [CG87]).

Linux did not even try to avoid using the x86 features available to those early

versions — quite the opposite. Linux started out as a project to ﬁnd out exactly

what you could do with the CPU, and as such it used just about every feature of

the CPU you could ﬁnd ranging from using segments for inter-process protection to

hardware assisted process switching.

However, the initial unportable approach was a case of an unportable implemen-

tation rather than an inherently unportable design. The goal of being compatible

with other UNIX’s resulted in a system that had portable interfaces despite the im-

plementation details. That portable design essentially made Linux itself reasonably

portable.

The ﬁrst Linux that was based on a architecture diﬀerent from the Intel 80386

was the port of Linux to the Motorola 680x0 family that actually got started rather

early in the development of Linux. However, the original Linux/68k project was not

really concerned with portability, but rather with making Linux run on 68k-based

Amiga and Atari computers.

Please note here the fundamental diﬀerence between the concepts of “portabil-

ity” and “ability to run on diﬀerent architectures”. A portable program is designed

and implemented in such a way that it will run on diﬀerent systems, while even a

non-portable program can be forced to run on other systems. The original Linux/68k

project did not try to make Linux portable, it only tried to re-write the Linux kernel

so that the new version would run on a new architecture instead of the original one.

While the Linux/68k project was in itself a huge step forward, the real portability

work began when the author was oﬀered an Alpha system by Digital in the hope of

making Linux work on the new Alpha architecture. Very early it became clear that

in order to be able to maintain both the stable Intel-based platform and support a

new and in some respects radically diﬀerent platform the kernel really needed some

major re-engineering to make it fundamentally more portable. The issues and the

end result is what is described in this paper.

The current situation is that the Alpha port has been totally successful, and is

in wide use and fully integrated into the main kernel sources. At the same time, the

2

portability issues have not only been considered from the viewpoint of the Alpha

port, but on a more general scale. As a result, other porting eﬀorts are much easier,

and the original Linux/68k project has updated its base to the new portable layout

and an integration of that work is in progress [Law97].

In addition to the Alpha and 68k ports, the port to the Sparc architecture has

also been very successful and is also in active day-to-day use and is mostly integrated

into the base sources. In addition to these four platforms there are also ports to

various other architectures, including the MIPS [B¨ac96], PowerPC [Dou96] and the

ARM (and StrongARM) [Kin96] lines of CPU’s.

At the same time the Linux kernel has also been ported to virtual environments

like the Mach [App96] and L4 microkernels [Hoh96]. In those environments the

Linux kernel is running on top of another kernel that provides another view of the

physical machine, to potentially get the best of both worlds.

Portability concerns

Unlike some other software projects, an operating system does not have any inherent

reason for existing: the role of an operating system is only to be the interface

between user programs who do the real work and the hardware it is to be done on.

An OS without viable user programs is useless, and an OS that does not work with

the hardware available is likewise pointless. The worth of an OS is measured in

how well it supports user processes, and how well it can implement the services it

provides on the hardware.

As such, the kernel itself is deﬁned by both the hard-

ware it runs on and the software it supports (Figure 1).

User and System

Software Interface

When writing the OS and when porting it to a new ar-

chitecture, both parts of the equation should be taken

into account. The person doing the port seldom gets to

choose his or her own idealized hardware, and similarly

the porting eﬀort may be constrained by user level soft-

ware requirements.

The Kernel

Hardware Interface:

CPU and I/O

On the hardware side, the concerns can be roughly

divided into two categories: the architecture of the CPU

subsystem (including memory management, caches and

optionally multiprocessing details) and the architecture

Figure 1: The Kernel:

the Interface between

user and hardware

of the I/O interfaces. On the other hand, the software side problems are generally

3

with regard to compatibility: to what extent (if any) do we want to be compatible

with other, possibly similar, operating systems.

However, more important than the hardware and software porting details them-

selves is having a general design that makes it easy to be portable. It should allow

for easy addition of new interfaces (both on the software and the hardware side)

without impacting existing code. In the case of Linux, this is especially important

due to the way development is done: a distributed development community with no

direct face-to-face contact between developers and little central authority to resolve

conﬂicts.

While the diﬀerent porting eﬀorts should be able to work without impacting

each other, general maintenance concerns require extensive sharing of code. If the

diﬀerent architectures end up doing everything on their own with very little common

code, the whole goal of portability has been lost: the diﬀerent platforms might as

well be considered separate projects.

For maintenance reasons the architecture-speciﬁc code needs to be small and sim-

ple, and all the complex functionality should be handled by architecture-independent

code that is shared across all platforms. This, in turn, requires a common concept

shared across all architectures. That concept is called the virtual machine.

The virtual machine concept

The most common approach to creating a portable system that has to be able to

adapt to diﬀerent hardware and software requirements is to use the concept of a

virtual machine [GC94, p. 70], [Tan92, p. 22].

Instead of having separate architectures that all would have to be handled sep-

arately, the idea of the virtual machine is to create an idealized abstraction of the

machine, and through that abstraction let the common code operate on the real

hardware. The common code need not itself be aware of the actual hardware de-

tails, and as such can operate on any physical hardware as in Figure 2.

The idea of a virtualized machine is not new in itself: just the fact that the

operating system is generally written in some high-level language is in itself a vir-

tualization of the underlying hardware with the help of the compiler that does the

translation from the high-level language to the hardware-speciﬁc machine code.

With the virtual machine abstraction, we only extend the virtualization already

oﬀered by the high-level language to something that can handle the details needed to

implement the kernel on top of the actual physical hardware. While most high-level

4

languages are good for virtualizing normal user level activity, we need to extend

that to cover all the details of the hardware.

The concept is in wide use because it is so useful: on

Windows NT the virtual machine is called HAL (Hard-

User and System

Software Interface

ware Abstraction Layer), while SVR4 UNIX has a more

limited HAT (Hardware Address Translation) layer to

handle just the machine-dependent virtual memory ad-

dress translations [GC94, pp. 92].

The Idealized Kernel

Hardware Interface:

CPU and I/O

With this kind of idealized kernel that does not inter-

face directly to the hardware but to the virtual machine,

the act of porting the operating system to another plat-

form or another user level binary interface is no longer a

matter of rewriting everything. Instead, only the outly-

Figure 2: The Virtual

Machine

ing interfaces to the virtual machine have to be implemented on top of the physical

machine, and the majority of code can be left untouched.

Obviously, the choice of virtual machine semantics is of primary importance

to how easily the system can be ported, and how eﬃcient the resulting operating

system is. Implementing a virtual machine that is at too high a level results in much

architecture-speciﬁc code, while a virtual machine that speciﬁes too much may not

map eﬃciently onto some speciﬁc hardware.

The Linux kernel virtual machine

While the concept of a virtual machine to handle portability issues cleanly is very

useful and widely used, it does add a layer of abstraction to the kernel. In fact,

depending on how this is done it can add two layers of abstraction: the layer between

the kernel and user space, and the layer between the kernel and the hardware.

In many cases, the layer between the kernel and user space can be ignored,

because the raw kernel interfaces are often exported more-or-less directly as system

calls to the user processes. This is the historical approach taken by most operating

systems, simply because the operating system was made for the hardware, and no

previous standard for the user level interface existed.

However, in the case of Linux, previous user level standards often exist on ar-

chitectures Linux is ported to, and to be optimally useful Linux needs to conform

to those standards in order to leverage oﬀ existing applications. Thus, both the

hardware and the user-level abstraction layers are generally required.

5

The problem with abstraction layers is that they add overhead when concepts

have to be translated from layer to layer. This has generally been one reason hard-

ware abstraction layers have been shunned: they add extra processing that could be

avoided by programming directly on the bare hardware or software interface. This

has to be taken into account, especially as one of the goals of Linux in the ﬁrst place

was to be eﬃcient.

6

2 Linux — the Design and Implementation

Unlike most modern research operating systems, the Linux approach to operating

systems is very pragmatic, and rather than concentrating on the design of a new

kernel the focal point has been on a solid and eﬃcient implementation. Instead of

throwing away proven concepts they have been reﬁned in new ways. [Tor93]

The Linux design is based on three main design issues that have directly inﬂu-

enced the implementation. The main issues are:

• Simplicity. The obviousness of this sometimes means that it is overlooked, but

an operating system kernel is a complex entity that has to be able to work

in an uncontrolled and potentially even hostile environment. Programming

errors in the operating system are much less acceptable than in most normal

programs, and the security issues are paramount. A complex design is harder

to verify against either errors or security issues, so simplicity of the basic

services is required.

• Eﬃciency. The kernel is involved with almost all activity in the machine, and

as such the kernel must be eﬃcient enough to never be seen as a performance

constraint.

• Compatibility. While the basic operations of a kernel are of supreme interest

to researchers in the operating system area, most people do not want to know

what is going on as long as their programs work. As such, one of the most

important features of an operating system is the lack of surprises it oﬀers to

the user, be he a normal end-user or a programmer. Even new features should

be oﬀered as a superset, rather than instead of functionality that the user is

accustomed to.

These design issues, coupled with a very pragmatic approach to programming,

has led to a system that shares features of both the traditional monolithic kernel de-

sign and of the newer research projects into modern microkernels. The microkernel

design itself was discarded here due to doubts about the eﬃciency and simplicity of

the approach, but during the development many of the features usually associated

with microkernels have been implemented on top of a more traditional monolithic

kernel.

The design has also been strongly inﬂuenced by the general availability of Linux

source code to anybody with access to the Internet. That has not only led to a very

7

dynamic development community that is physically spread all over the world, it has

also directly impacted the design itself.

Rather than concentrate on a concept of a kernel binary that interfaces to the

rest of the world (be it hardware or software), the Linux approach has been to have

sources available that can be used to create the binary we want. This approach

allows the actual interfaces to be speciﬁed at compile time rather than at run-time.

The more traditional link-time and run-time interfaces are naturally also avail-

able, and give access to more dynamic feature sets. By using run-time dynamic

linking of modules into the kernel, the user can add and remove features as needed,

but the availability of sources always adds the possibility of increased performance

by specifying the conﬁguration at compile time and letting the compiler handle

interfacing issues.

2.1 The design — compatibility

The three main design issues – simplicity, eﬃciency and compatibility – have nat-

urally formed the basis for the kernel implementation. The compatibility issue

essentially deﬁnes the external shape of the system while the issues of simplicity

and eﬃciency deﬁne the internal implementation of it.

The wish to be compatible with other UNIX-type operating systems more or less

deﬁnes the kernel interface that is exported to system and user applications. Note

that compatibility extends to outside the machine itself: not only does the compat-

ibility issue require a compatible programming and user interface to the kernel, it

also involves the integration of the system with other external requirements. These

requirements include following standard protocols on external networks, but also

being able to live together on the same machine with other operating systems. This

can involve supporting a common disk partitioning scheme, or a particular boot-up

sequence.

As being compatible so clearly restricts the external interfaces of the system,

many new operating system projects choose to be less compatible in order to be

able to fully express the wishes of the designer. Instead of being constrained by

what others have done, such a project can soar to new heights of design beauty, at

least in the eyes of the designer.

Sadly, those projects invariably tend to fail in real life, getting only a niche mar-

ket (if that) due to their lack of common applications or interfaces. As mentioned

8

earlier, the worth of an OS lies not in what the OS can do, but in what the OS

allows others to do. Therefore a more pragmatic approach is often required.

While compatibility clearly restricts the design of the system, it at the same

time makes the system easier to integrate with others, and also potentially allows

a large base of existing applications to be used directly on top of the new kernel.

In the case of Linux, being compatible with generic UNIX meant that a plethora

of freely available UNIX applications were directly useful on top of Linux. So while

the collar of compatibility may feel chaﬁng at times, it also allows the system to be

built on top of the work done by others.

However, while being compatible is good for the usability of the system, it tends

also to be boring. A grey mass of compatible operating systems is not what makes

the heart of system designers beat more quickly, nor does it allow the systems to

show much innovation other than at an implementation layer.

Because the Linux project has been done non-commercially by people all over the

world connected by the Internet, a boring system would simply not work: lacking

most of the money-related incentives Linux depends on being vital and interesting

to attract developers. As a result, Linux inevitably has accumulated interesting

extensions to the basic UNIX compatibility, and one of the challenges has been to

add these extensions in a manner that cleanly integrates with other code.

2.2 Implementation

While the basic shape of the system is constrained by the compatibility issues, the

designer is free to decide how that shape is in fact implemented.

While most modern operating system research tends to favor a microkernel and

object oriented design, Linux has leaned towards a much more traditional design.

Like the UNIX that it tries to be compatible with, Linux uses a very traditional

monolithic kernel mostly written in portable ANSI C.

Even though portable C is generally well suited for systems programming (that

was one of the original uses of the language, after all), the Linux kernel additionally

uses some features oﬀered by the GNU C compiler that make it easier to implement

certain features eﬃciently. The two notable extensions to C used are inline functions

and inline assembly language code embedded within C.

In addition to the code written in C, some low-level routines have been writ-

ten directly in assembly language either due to performance reasons or due to cir-

cumstances where the C semantics are unable to cover the exact code generation

9

requirements (usually requirements imposed by the hardware itself, such as fault

handling routines).

The traditional design was chosen because of serious doubts about the perfor-

mance and simplicity of microkernel-based operating systems. While much work has

been done on the performance front of microkernels (see for example [BGG⁺91]),

the very fact that such extensive work is needed in the ﬁrst place indicates that the

issues are by no means obvious.

Also, while simplicity has been used as one of the arguments for a microkernel-

based design, that argument seems dubious at best considering the complex inter-

actions required between the diﬀerent modules in a microkernel design. This matter

is generally made worse by the complex performance issues, and the basic simplicity

of microkernels is often overwhelmed by the complexity of these issues.

While this thesis by no means tries to argue against microkernels per se, the

author feels that the advantages of microkernels have been overstated in the liter-

ature, and that a simple and straightforward traditional approach may serve the

needs of users better. Microkernels tend to be best suited for distributed systems

rather than traditional UNIX type operating conditions. Even then the question

remains at what level the distribution should be done².

That said, the very basic UNIX design to some degree favors a microkernel ap-

proach to user level services. Even a traditional monolithic UNIX kernel can be

seen as only a small microkernel when considering all the services done by funda-

mental non-kernel UNIX daemons. While many other operating systems tend to

handle user veriﬁcation and session handling in the kernel (and in some cases even

graphics services!), that mentality is against the very philosophy of UNIX, where

those services are provided through user-level daemons.

2.3 Kernel organization

The basic Linux kernel is directly organized around the primary services it provides:

process handling, memory management, ﬁle system management, network access

and the drivers for the hardware. These areas correspond to the kernel source

directories kernel, mm, fs, net and drivers respectively. Additionally, these areas

are often sub-divided into speciﬁc services (see Figure 3).

The basic portability tenet in Linux is the single source concept, i.e. all ar-

chitectures share the same basic source tree. The architecture-dependent code is

²See Chapter 4.4.2 on Linux implementations of distributed environments

10

available in the arch directory which has sub-directories for each architecture sup-

ported. While there are no absolute rules about what an architecture has to oﬀer,

common concerns have generally resulted in similar organizations of the architecture

sub-directory, and most of them follow a setup with separate directories for process

handling (kernel), memory management (mm) and library routines optimized for

the speciﬁc architecture (lib).

This hierarchical subdivision of the sources not only makes it easier to get an

overview of the system, it also acts to modularize the kernel. While nothing prevents

code to be written that ignores this conceptual layout, maintainability requires

that the developers keep clean interfaces between diﬀerent parts of the system.

The modularization not only keeps the sources more understandable, it is also a

requirement when many developers are working on the same project at the same

time. Diﬀerent developers can work on their speciﬁc area of the kernel without

generally having to worry about what happens in other areas.

The modular nature of the kernel has also led to a limited

linux

kind of object oriented design. While the C language used by

the kernel does not inherently support this kind of object ori-

entation, the eﬀects can be similar through the use of function

pointers and common data structures. This is especially no-

table in the ﬁle system code, where the virtual ﬁle system layer

that implements the general UNIX ﬁle system semantics will

see the speciﬁc low-level implementations through this kind of

object interface³.

kernel

mm

fs

ext2

nfs

proc

arch

alpha

i386

m68k

mips

2.4 The virtual machine implementation

While abstraction layers required by a virtual machine ap-

proach can generally be a performance problem, Linux has

a big advantage: source availability and a monolithic kernel.

Rather than having a binary-level abstraction layer, the

abstraction layer can be moved up at a source level. That way

Figure 3: Kernel

organization

many of the transformations required can be handled by the compiler at kernel

³The use of C++ would have allowed these kinds of object interface features to be described

on a language level. This was tried early in the Linux development process, but at that time the

compiler technology was not stable enough, and there were concerns that a too high-level interface

might hide details that are critical in a kernel.

11

compile-time rather than at run-time, and will as such not impact performance

adversely.

As it is a monolithic kernel, there are no external entities and ﬁxed protocols that

have to be taken into account, so the compiler is free to optimize away operations

that are not needed on speciﬁc architectures. More speciﬁcally, because nothing

else depends on the internal data structures those data structures can be tailored at

a source level to match the user or hardware requirements as closely as necessary.

The way the Linux kernel virtual machine is implemented is through the creation

of architecture-speciﬁc header ﬁles and a common API between the generic kernel

code and the architecture-speciﬁc ﬁles. These header ﬁles create the image of the

virtual machine that the common code is running on.

The virtual machine header ﬁles deﬁne the machine by using the C language pre-

processor (#define directives) and by using various type deﬁnes and inline functions

to implement the virtual machine on top of the physical hardware. While inline

functions are not part of standard C, they are implemented by most C compilers,

and the support for them by the normal Linux C compiler (GNU C compiler, also

known as “gcc”) is extensive. As the GNU C compiler is itself very portable, the

use of C extensions is not a problem.

The use of gcc also allows the architecture header ﬁles to use inline assembly

routines to implement code that cannot be expressed eﬃciently with portable C.

This, together with inline functions, allows the compiler to eﬃciently map the virtual

machine semantics on top of any speciﬁc hardware, and means that the virtual

machine abstraction results in little or no performance loss.

In addition to the architecture-speciﬁc header ﬁles, each architecture has its own

subdirectory in the arch directory in the system sources. This directory contains the

set of rules to build the kernel sources for each architecture; what ﬁles to use, what

special options the compiler needs for this architecture and so on. This approach

allows a maximum of ﬂexibility with regard to architecture-speciﬁc code, and does

not impose any restrictions on what the code does.

Even in the architecture-speciﬁc parts of the kernel, common concerns have re-

sulted in a basic layout of the architecture-speciﬁc kernel sources. Most architectures

tend to have the subdirectories kernel, mm and lib available. These subdirecto-

ries contain code for generic kernel functionality (e.g., system call interface, trap

handling), low-level memory management code (e.g., page table initialization and

page fault handling) and optimized architecture-speciﬁc library code (e.g., memory

copying, network checksum calculations) respectively.

12

Additionally, many architectures have software ﬂoating point math emulation

for hardware that does not have ﬂoating point available, or for hardware which

needs software support for special cases like underﬂow and overﬂow situations.

13

3 Software Interface Portability Issues

One area of concern with regard to operating system portability that is often over-

looked is the software interface side. The issue has traditionally been ignored,

because the operating system designer was also in charge of designing the interface

to the system and application software, and as such the operating system kernel

could freely implement any interface it wanted.

However, the computer industry

has matured, and most modern op-

erating systems have to live with

the fact that interface standards al-

ready exist, and cannot be ignored.

Sometimes the standards are on a

source level, allowing much freedom

in actually implementing the stan-

dard. In the UNIX world, the most

well-known example of this is the

POSIX (Portable Operating System

Interface) standard [Ins96].

P1

P3

P2

Personality

Server 1

Personality

Server 2

More often, the standard is a bi-

nary compatibility standard that is

required in order to run legacy appli-

cations designed for a previous oper-

ating system. In this case, the de-

signer must very closely follow the

exact interface details of the binary

Common Microkernel

Figure 4: Multiple personalities

standard so that no legacy applications break when the system is upgraded to a

new operating system kernel.

If no previous binary standard exists, the operating system designer is generally

free to create any binary interface to the kernel, and in that sense a lack of a standard

can be very liberating. However, even then the designer is generally limited by the

hardware protection mechanisms in selecting the exact interface for the kernel, and

as such the architecture to some degree always limits the interaction between user

mode and the kernel.

14

Another issue that is sometimes faced is the existence of multiple standards on

the same platform, with subtly diﬀerent⁴and possibly even contradictory require-

ments. While this is fortunately rare, it may be one of the primary concerns for the

system designers.

The issue of multiple interfaces to the operating system is usually handled by

having so-called OS personalities. One personality handles the interfacing issues of

one interface standard, and all personalities share the same common core operating

system routines.

In one sense, the concept of OS personality is very similar to the issue of porta-

bility to the hardware, and the same virtual machine abstraction can be used to

handle the case of multiple interfaces. The idealized kernel does not directly im-

plement any of the interfaces, and the software portability issue is then a matter of

mapping the wanted user interface on top of the idealized kernel image.

One large diﬀerence between OS personalities and hardware platforms is gener-

ally that the kernel only runs on one physical platform at a time, yet it potentially

has to handle multiple personalities concurrently. Microkernel operating systems

generally handle this by the addition of a personality server for each personality

(see Figure 4).

In this kind of personality server approach, the overhead of such personality

handling is quite high, though, and is one of the primary reasons for work on

migrating threads (see for example [FL94]) and moving system services into the

kernel address space (see [LHFL93]). More importantly, while the concept of a

personality server allows a personality-independent view of the system, most real

system usage tends to favor one primary personality. The other personalities are

then mainly used for backwards compatibility or emulation of non-native systems.

3.1 The implementation of the software interface

Because the primary goal of Linux was never to act as a platform for emulating

other systems, the Linux approach to the software interface personality problems is

straightforward.

Rather than supporting the notion of multiple personalities of equal importance,

the Linux kernel interfaces tend to have just one primary personality for which the

system is optimized. Having a primary personality allows the use of compile-time

optimizations for the common case, and means that the internal structures can be

⁴Sometimes the diﬀerences are not so very subtle.

15

optimized for the expected use of the system. The integration between personalities

and the generic code is much tighter (Figure 5).

Such a tight binding between the common code and the primary software inter-

face does not preclude the use of secondary personalities, and should be seen not as

a limitation but an optimization. Also, it should be noted that the primary person-

ality can be adapted to the platform and is not inherent to the virtual kernel itself.

On the Alpha platform, for example, the primary personality is the Digital UNIX

compatible interface, while Linux on the Sparc has a primary personality that looks

like the original Sun operating systems on the same hardware.

In addition to the primary personality, any

number of secondary personalities can exist. These

P2

secondary personalities can be dynamically loaded

P3

P1

into the kernel at run-time as needed, as shown by

the iBCS2-personality⁵on the original Intel plat-

form [Int91].

Note that these secondary personalities have the

Primary

same access to all kernel resources as the primary

personality, and the diﬀerence is not so much a con-

ceptual one as a question of optimization: for which

personality has the internal data structures been

optimized. While the primary personality generally

needs to do only minimal translation from inter-

nal kernel data structures for user mode requests,

secondary personalities may have to translate user

requests into a format suitable for the kernel.

Generic

Code

Figure 5: Primary personal-

ity

In addition to the secondary personalities, there can also be tertiary personalities:

environments that are not directly supported by the operating system kernel, but

that can be emulated in user space, often with at least limited kernel support.

This is how Linux/x86 is able to run DOS and Windows programs — through an

external emulator and minimal support from the kernel to expose the hardware

features needed for eﬃcient emulation.

Similar emulators exist for running Macintosh, Commodore 64 and other binary

formats. Another example of this is the ability to run Java programs through a

Java interpreter or just-in-time (JIT) compiler [GM95]. In the case of Java (and

⁵iBCS2 stands for intel Binary Compatibility Speciﬁcation 2, and is the binary format generally

used by non-Linux UNIX-like operating systems on the i386 platform.

16

potentially other binary formats), the kernel will automatically notice a Java binary

and start up the appropriate emulation technology for seamless integration with

native programs.

3.2 Personalities

In the design of Linux, the basic approach has been to adhere to the standard UNIX

semantics as closely as possible. However, as many UNIX haters will tell you, the

concept of standard UNIX is to some degree a contradiction in terms.

One of the problems with UNIX is that while the basic approach to how things

work is the same, diﬀerent vendors have done versions of UNIX that diﬀer in the

details. And through the quarter century history of UNIX, quite a lot of diﬀerences

have evolved.

While the UNIX community has started to address this very issue through vari-

ous standards, and while the current state of UNIX is by no means as splintered as it

was only ﬁve years ago, the situation is by no means perfect. As some form of UNIX

is available for just about any computer hardware in existence, the UNIX standards

obviously tend to be source standards rather than binary interface standards.

In addition to varied hardware, various vendors have their own value-added

features that they have specialized in and that they use to diﬀerentiate themselves

inside the basic UNIX community. That, together with 25 years worth of backwards

compatibility for legacy applications tends to make the whole term “UNIX” a very

fuzzy issue.

In order to be really compatible with what people perceive as “UNIX”, it is

thus not enough just to follow the standards. The standards are generally much

more limited than any real UNIX is, and while getting POSIX-branded has been a

favorite pastime of many operating systems (not all of them even remotely UNIX-

like) the fact that they follow just one standard by no means indicates that the

system behaves as the user expects it to behave.

As such, the Linux compatibility issues go far beyond just the paper standards,

and rather than be a subset of all UNIXes the idea is to be a superset of the behavior

people expect.

While being a true superset of all UNIX behavior is not technically feasible (or

necessarily even a good idea), Linux tries to distill the basic essence of UNIX, and

implement all the relevant features people expect from such a system. Adherence

17

to standards generally make it easy to port programs that have been written with

the standards in mind, but Linux goes further than that.

Now, being a superset of what people expect should in theory be simple. Sadly,

that is not the case. The main problem is that while the basic UNIX operations

are similar, the diﬀerence in details can often result in two systems doing the same

thing in diﬀerent and contradictory ways.

For example, while the standard interface for reading the contents of a directory

is through the readdir() interface on a C library level, the actual implementation

of that interface at a lower level can be very diﬀerent on diﬀerent UNIX operating

systems. And at a binary compatibility level Linux would need to implement all of

those diﬀerent interfaces.

Considering at the same time that the current default Linux sources contain no

less that 17 (sic!) diﬀerent ﬁle systems, the duplication of code would be horrendous

if all these ﬁle systems would have to know about all the possible interfaces to read

a directory. Obviously the act of reading the directory contents has to be handled

with a virtual interface that can do the appropriate translations for all the diﬀerent

systems.

Another problem in compatibility is the diﬀerent layouts of data structures and

the diﬀerent sizes of various operating system types that diﬀerent UNIX operating

systems support. This indicates that the basic kernel routines should be type-

independent, in order to be able to handle diﬀerent layouts. While this typing

can be done at compile time (allowing for optimal code to be produced for the

primary personality), the sources should still be able to accommodate any type

layout without becoming a mess of conditional compilation.

Finally, the issue of various environment constants has to be handled. Error

numbers change from system to system, as do the command numbers used by various

system calls (notably ioctl() and fcntl()). All this has to be cleanly separated

into architecture-speciﬁc header ﬁles that allow the kernel to share the basic sources

yet diﬀer in details.

18

4 Hardware Portability Issues

While software interface portability issues are important for any operating system,

it is generally the hardware portability issues that most people tend to consider ﬁrst

when talking about portability. These hardware portability issues can roughly be

divided into a few main areas.

The ﬁrst concern, and an area that is also seen in normal code outside the kernel,

is the issue of the data representation of the CPU. Most of the CPU internal issues

are hidden by the compiler: the programmer needs generally not worry about the

number of registers of the CPU or the actual code generation. But other issues often

need to be taken into account: the byte order and alignment of values in memory

and the width of internal registers, for example.

Most other portability concerns are problems that generally do not impact user-

level programs: diﬀerences in the memory management and caches in a system,

and possibly multiprocessing details. Similarly, the diﬀerences in I/O architectures

should never be seen by the user, but they directly impact the operating system.

4.1 Data representation portability issues

An obvious problem with any portability project is the CPU architecture of the

hardware being ported to. In fact, so obvious is this problem that other problems

like the I/O architecture are often overlooked completely.

Much of the CPU portability is handled by using a portable language and hav-

ing a compiler that is able to translate the code to various diﬀerent architectures.

That kind of portability is more or less required by any large project, and the real

portability work is then in making sure that semantics that are not guaranteed by

the language will port correctly across diﬀerent platforms.

With a complete enough language the compiler could handle all the issues in

CPU portability, and porting would be a matter of just recompiling everything to

the new architecture. However, currently no such language exists that is suited for

systems-level programming like creating an operating system.

Linux, like most operating systems these days, is written in C, which is reason-

ably portable, while allowing a close interaction with the architecture. That close

interaction allows the programmer to program on a very low level if required, but

the very fact that C allows that kind of low-level programming obviously also ex-

poses the machine more than some other programming languages. As such, badly

19

written C code by no means automatically works the way you would expect on

another machine.

During the porting of Linux to other architectures, there have also been situa-

tions where a compiler error had to be ﬁxed before work could proceed. This was

true mainly on the Linux port to the Alpha, because the gcc port of Alpha was

itself reasonably new when the Linux port was started.

4.1.1

Diﬀerent data sizes

One of the more obvious diﬀerences between diﬀerent CPU’s is the size of the

internal data registers and the virtual address space. Thus, for example, the Intel

80x86 has 32-bit registers and address space, while the registers and address space

on an Alpha are 64 bits wide. This results in the standard data types in C being

diﬀerent on diﬀerent architectures.

This diﬀerence in the basic data types is often something that is brought up as

the main problem in portability. However, at least in the case of the Linux kernel

itself, the size of the data types were mostly irrelevant, and while there was much

nervousness about moving from a 32-bit architecture to a 64-bit architecture like

the Alpha, it turned out not to be much of a problem in real life.

In normal portable code, the size of the underlying hardware data types generally

does not matter. In many cases it is totally irrelevant whether a “long” in C is 32

bits or 64, and most of the kernel code will use whatever size that happens to be

the native size for some particular machine.

However, in some cases data size does matter, usually because the size is deﬁned

by some external entity — this is usually the case in ﬁle systems where the data

layout on the physical disk has a certain machine-independent standard associated

with it, for example. Similarly, most of the networking code depends on being able

to work with addresses of a certain size, and hardware interfaces to device controllers

generally have a very rigid type structure independently of the hosting CPU type.

Data sizes — implementation

In Linux, the architecture-speciﬁc header ﬁles export a small set of speciﬁc types to

be used when a piece of the operating system needs to use some speciﬁc type-size.

Including the ﬁle <asm/types.h> makes the C types u8, u16 and u32 accessible

when something speciﬁcally needs a 8-, 16- or 32-bit unsigned integer.

20

The corresponding signed data types (s8, s16 and s32) are also available, but

use of them should be minimized as any object that needs to have a certain number

of bits is generally not a normal integer.

Due to standard C name space rules, these types are only available when compil-

ing kernel code. If a speciﬁc type needs to be exposed to a user-space implementa-

tion, there exists versions of the same types with two underscores prepended to the

type name ( u8,

u16 etc). These are meant to be used mainly in public header

ﬁles where the normal names for these types cannot be exposed.

Unresolved issues

While the current Linux kernel has proved to be portable to a wide variety of

hardware, including diﬀerent word sizes and endianness, the issue of data type

portability is by no means completely handled.

One data size problem that is often overlooked is the size of the character set.

While an 8-bit character is currently the standard, it totally ignores the issues of

wide characters used mainly in Asia. The Linux kernel sidesteps the issue completely

by not doing any character translation in the kernel itself, and leaving all such issues

to user space.

As far as the kernel is concerned, all data is a stream of 8-bit bytes, and the

interpretation of those bytes (possibly by combining two or more bytes into a wider

character) is left to the user programs. The standard console driver supports various

character translations through the use of UNICODE and loadable font-sets, but the

policy of setting up the fonts and translation tables is handled at a user level. The

tools to do this both under a graphical user environment and on the system console

already exist, so in that sense the issues are already solved.

Another type-related issue is the problem of machines with word-sizes that are

not powers of two or machines that are not byte-addressable. While those kinds

of machines have existed in the past and are still in use in places, it is considered

unlikely that such machines will be created in the foreseeable future (at least for a

larger market where a Linux port would be interesting).

For similar reasons the problems with integer bit-level representation has not

been considered, and it is likely that machines that implement other integer repre-

sentations than two’s complement would require some portability work. The current

portability eﬀorts have not been worried about such issues.

21

4.1.2

The impact of alignment restrictions and byte order

Much more complex than the size of the data is the memory access details of data

on some speciﬁc CPU. Even when two CPU’s share the same data types, the repre-

sentation of that data in memory can be diﬀerent and many CPU’s restrict memory

accesses to only aligned addresses.

For example, on the original

Linux platform, the Intel 80x86, the

0

4

0

4

CPU will allow a 32-bit word to be

loaded and stored at any memory ad-

dress regardless of alignment. The

CPU is byte-addressable and little-

endian, i.e. it stores the least signiﬁ-

cant byte of a multi-byte entity at the

low address and the more signiﬁcant

8

12

Aligned access

Unaligned access

Figure 6: Alignment

bytes at higher addresses in memory (Figure 7).

In contrast, the Sun Sparc line of computers are big-endian and memory accesses

must be aligned: if a program tries to store a 32-bit word at a byte address that

is not evenly divisible by four the CPU will trap (Figure 6). Most other modern

architectures are also likely to trap on unaligned accesses.

These kinds of memory access details turned out

Value 0x12345678 in memory

to be much more of a problem than the size of the

data itself. The size of registers generally does not

matter for most code, and the cases in which it does

matter are well-deﬁned and generally not problem-

atic to ﬁnd.

0

1

2

3

0x12

0x34

0x56

0x78

0

1

2

3

0x78

0x56

0x34

0x12

With alignment, the problem is that while the

compiler can make sure that all internal kernel data

is correctly aligned, the kernel cannot assume align-

ment for external events, notably user space argu-

ment pointers.

Big-Endian

Little-Endian

Figure 7: Endianness

For byte order, the problem is that the eﬀects of byte-order are generally much

more prevalent and harder to ﬁnd than just the size of data. The size of any data

structure can often easily be changed by just changing the type of the C structure

that is used by the kernel to access the data. However, when it comes to byte order

there is no equivalent change we can make on the C language level, so any byte

22

order changes have to be handled by hand by the programmer rather than let the

compiler sort it out.

Alignment and byte order — implementation

As with type sizes, both alignment restrictions and byte order issues have their own

header ﬁles that allow the kernel virtual machine to use any alignment or byte order.

The virtual machine is assumed to always trap on unaligned memory accesses, and

thus if an access is known to be unaligned at compile time it can be handled by a

simple inline function that does the extra work to load an unaligned datum as two

separate aligned loads⁶.

However, the alignment of many accesses cannot be known at compile time.

To be safe, all these accesses would need to be done with the potentially complex

unaligned access function, and that would impact performance negatively especially

in cases where unaligned accesses are rare. This is especially true with many user

level accesses, where the user-supplied pointer is almost always aligned, but can

potentially be unaligned in some special cases.

To handle those kinds of rarely unaligned cases eﬃciently, the kernel virtual ma-

chine assumes accesses are aligned and if an unaligned trap occurs, the trap handler

will ﬁx up the occasional unaligned case. Trapping will be much more expensive

than handling the unaligned cases explicitly, but if the unaligned accesses are rare

this works out well. To make sure the user is aware of the potential performance

degradation the system will print out a warning when an unaligned fault has to be

handled. Under normal circumstances these messages are never seen.

Note that some code, notably the networking code, will actively try to align

all the data structures it uses but will not guarantee that the structures are al-

ways aligned. This is a prime example of code where the optimistic unaligned trap

handling works eﬃciently.

When it comes to byte order, the virtual machine makes no assumptions about

the native byte order of the machine. Indeed, in most cases byte order simply does

not matter. However, in those cases where byte order matters (on-disk layout with a

certain speciﬁed byte order, or networking code, for example), the functions exist to

convert the native CPU byte order to and from both little-endian and big-endian⁷.

⁶On hardware where alignment does not matter, the inline function obviously needs not do any

extra work, so there is no overhead for architectures that handle unaligned accesses in hardware.

⁷Little-endian byte order is used mainly by various ﬁle systems and device drivers that use the

standard PC byte order, while big-endian byte order is used by the Internet networking protocols.

23

Unresolved issues

Especially when it comes to alignment all the issues are still not resolved completely.

While handling the uncommon unaligned cases through trapping is eﬃcient and

guarantees the correct behavior even in the presence of potentially unaligned data,

this particular design decision may have to be re-thought.

The problem with the current virtual machine design is that while it maps very

eﬃciently onto most hardware, there exists various hardware implementations that

do not conform to the hardware model expected by this virtual machine design.

Most notably, the design requires that the hardware supports unaligned accesses

through trapping, and that may not always be the case.

While most hardware conforms to the expectations of this virtual machine model,

there are cases where the CPU may handle unaligned memory accesses by simply

ignoring the low address bits rather than trapping on the access. This is one example

of how choosing the wrong virtual machine concept can result in problems that may

require a redesign in the future.

The other unresolved issue with regard to unaligned data is the state of the

compiler support for the concept of unaligned accesses. While the current virtual

machine depends on explicit programmer action to support unaligned accesses, a

future version might depend on compiler directives to let the compiler automatically

create the necessary code to handle the unaligned accesses. Such compiler support

already exists to some degree, and can be used hide the issue from the programmer

the same way the issues of type size have been hidden.

4.2 Kernel memory management

One of the most complex areas in the kernel in general, and with regard to portability

in particular, is the virtual memory management.

Memory is one of the most fundamental resources in the system, and as such

the performance of the memory management layer is critical to the system. Making

memory management eﬃcient is thus of primary importance: not only do the rou-

tines have to be fast, they have to be clever too, sharing physical pages aggressively

in order to get the most out of a system.

The memory management code is also fraught with race conditions and dead-

locks, and trying to share as much of the complex memory management routines as

possible would be a good thing in order to not have to maintain multiple separate

copies of this complexity.

24

However, to make matters even worse, memory management is typically one of

the areas where there are absolutely no hardware standards, and diﬀerent CPU’s

use very diﬀerent means of mapping virtual addresses into physical memory pages.

As such, memory management is one area where traditionally most of the code

has been very architecture-dependent, and only very little high-level code has been

shared across architectures even though we would like to share a lot more.

While the generic virtual to

Virtual address:

L1 Index

L1

physical memory mapping can be

seen as any function that maps

a virtual address into a physical

address, extreme performance re-

quirements mean that the func-

tion has to be reasonably simple.

In fact, on a low level all cur-

rent CPU’s use a on-chip virtual

memory mapping cache usually

called a TLB (Translation Looka-

side Buﬀer), and in the end all

virtual memory mapping schemes

translate into ﬁlling this cache

with the appropriate translation

information.

L2 Index

Offset in page

L2

Figure 8: Two-level page table tree

While any mapping strategy is possible, current CPU’s tend to handle the virtual

memory translations in three diﬀerent ways: with page table trees, with hash tables

or with a pure software-ﬁll TLB. But even when the basic approach is similar in

two architectures, the low-level details are often very diﬀerent.

Depending on the memory management unit, the page tables may contain extra

information aside from the necessary protection and translation information. Some

architectures support special dirty and accessed bits, to be used by the VM routines

to determine whether a user has written to a page or not, or whether the page has

been recently accessed. Other architectures expect the operating system to keep

track of this information by hand.

For example, the original Linux platform, the Intel 80x86, has a two-level page

table tree (see Figure 8), and implements both dirty and accessed bits in hardware.

In contrast, while the Digital Alpha from a system software viewpoint also has a

normal page table tree, on the Alpha the depth of the tree is three due to the larger

25

virtual address space, and the Alpha lacks hardware support for dirty and accessed

bits.

On the other hand the Power-

Virtual address:

Page Index

PC and some of the Sparc CPU’s

do not have a real page table at

all: they have a hash table that is

used to look up the physical page

that corresponds to a virtual ad-

dress (Figure 9). If the page can-

not be found in the hash table they

trap to software. [Int93b]

Offset in page

Hash

Function

Because the hashed memory

mappings generally cannot fully

describe the virtual memory setup,

they are more appropriately called

in-memory extensions to the on-

chip TLB so as not to confuse

them with full page tables.

Figure 9: Hashed in-memory TLB extension

Finally, MIPS CPU’s and the newest UltraSparcs from Sun do not have any

architecture-speciﬁed page tables at all, they only have the on-chip TLB and any

miss in the TLB will result in a software trap to reﬁll the TLB⁸(or handle a page

fault if no physical page is available for the oﬀending virtual address). [Int93a]

4.2.1

Memory management through virtual page tables

As seen above, no standard way of mapping virtual memory exists. Indeed, the

Sparc line of CPU’s have used all three diﬀerent mapping strategies described above

in diﬀerent versions of the architecture. And yet, despite these fundamental diﬀer-

ences in physical hardware we would like to use as much common code as possible.

The way this is accomplished is by having a common virtual mapping scheme

in the Linux kernel virtual machine (see chapter 1), and mapping that common

memory management scheme onto the physical hardware. This allows us to share all

memory management code over all supported architectures, and any improvements

⁸This is also the case with the Digital Alpha architecture, but the Alpha architecture also

speciﬁes a low-level software layer called PAL-code that makes it appear as if the hardware had

three-level page table [Dig92, pp. 3–2].

26

to the memory management are automatically supported on all platforms. The only

thing that the architecture-speciﬁc code needs to know about is the mapping from

the virtual machine onto the physical hardware.

While the principle of the vir-

Virtual address:

tual machine approach is simple

to grasp, the details are not as

obvious. What mapping scheme

should be used in order to make

the translation to the hardware as

eﬃcient as possible, yet be generic

enough that the scheme is useful

as a superset of any realistic real

hardware? If the virtual machine

is too limited, it cannot take ad-

vantage of large address spaces or

special hardware features.

L1 Index L2 Index L3 Index Offset in page

L1

L2

L3

There are also secondary con-

cerns: the virtual machine mem-

ory mappings must be memory-

Figure 10: Three-level page table tree

eﬃcient, so that the mapping information does not take up a lot of physical memory

that could be used to better advantage for ﬁle system caching or running user pro-

grams. Remember that not only does the kernel have to keep the page tables of

the virtual machine in memory, the page tables of the physical machine also take

up space.

With all the requirements placed on the page tables of the virtual machine, the

choice in the end is not diﬃcult. It turns out that a multi-level page table tree is

an approach that can easily be expanded to match large virtual memory spaces by

just adding levels. It is ﬂexible, simple, and reasonably eﬃcient.

Not only is a multi-level page table a good generic answer to the page table

problems, it can also often easily be made to map closely to the actual hardware,

so that the mapping of the page tables from the virtual machine to the physical

machine is easy to do. In fact, by choosing the right virtual machine page table

setup, the same page tables can be used by both the kernel virtual machine and

the physical memory management unit. In those cases the mapping overhead is

obviously zero.

27

Architectures which natively use multi-level page tables include the Intel 80x86,

the Digital Alpha and the Motorola 68k family. The Intel hardware uses two-level

page tables [CG87, pp. 465], while the 68k and the Alpha use three-level page tables

[Dig92, pp. 3–1]. As three levels is enough to map on the order of 40–45 bits of

virtual address space, and no current hardware supports more, that was the choice

for the Linux kernel virtual machine (see Figure 10).

It is worth noting that a virtual

Virtual address:

L1 Index

L1

machine page table tree of level n

can easily be mapped to a physical

page table of level n − 1 by simply

“hiding” one level inside another

(Figure 11). This is actually used

by Linux/x86 to create two-level

physical page tables from the ker-

nel virtual three-level page tables.

The hiding approach also allows

the future expansion of the kernel

virtual page tables without break-

ing any current two- or three-level

page table machines.

L3 Index

Offset in page

L3

L2

Figure 11: Hidden mid page table level

Mapping a multi-level page ta-

ble tree onto a pure TLB architec-

ture like the MIPS or the UltraSparc is similarly trivial. The TLB reﬁll code can

just follow the page tables by hand, and because of the simplicity of the data struc-

tures this reﬁll code can generally be written in low-level assembly directly in the

TLB miss trap handler for best performance.

The case of hashed in-memory TLB extensions like the PowerPC, the mapping

is no longer as trivial, but there is a generic approach that works for these cases or

any other memory management setup. The algorithm is as follows:

1. On-chip TLB misses.

2. Hardware does a hash table lookup, if mapping found: fill in

TLB and restart the access.

3. Mapping not found: hardware invokes kernel page fault han-

dler.

28

4. Kernel page fault handler does a kernel page table walk, and

inserts the page found into the hash tables and returns.

5. The memory access is re-tried, and this time the mapping is found

in the hash table in stage 2.

The above algorithm can clearly be used on any hardware, and is the fall-back

position if a simpler mapping cannot be done directly.

4.2.2

Page table mapping coherence

While having separate virtual and physical page tables⁹is good for portability, they

obviously also add a new problem: the problem of keeping the two page tables

coherent. Some way of invalidating the hardware page tables when the virtual page

tables have changed is obviously necessary.

Even when the virtual page tables map directly to the physical page tables and

there is just one copy of the page tables itself, the same coherency issue makes itself

felt in the form of TLB coherency. The internal CPU on-chip TLB generally still

needs to be invalidated even though the external in-memory page table is updated

correctly in synchronization with the virtual page tables.

Because the on-chip TLB invalidation is an issue regardless of the actual layout

of the physical page tables, we can generalize the TLB invalidate concept to cover

the case of a separate hardware page table as well. Thus, as far as the virtual kernel

is concerned, any hardware page table is considered only an extension of the TLB.

This concept allows us to create an algorithm for changing the virtual page tables

as follows:

1. Change the virtual page tables.

2. Invalidate the CPU on-chip TLB information.

3. If the hardware page tables are separate from the virtual page

tables, invalidate the hardware page tables.

Again, this algorithm will work correctly with any hardware page tables. The

kernel needs to be careful about race conditions, making sure that the hardware page

⁹Note that even when the virtual and physical page tables can be mapped to use the same

layout, they should be considered conceptually totally separate.

29

table information is never used after the virtual page tables have been changed, but

as the changes to the page tables are well localized this is generally not a problem.

The main problem is to ﬁnd the right level of invalidation granularity: invali-

dating too much hardware state can be very expensive especially with in-memory

extensions of the TLB. This issue is closely linked with the issue of hardware support

for multiple address spaces: when we change the virtual mapping of one process

we should not necessarily have to invalidate all the hardware state associated with

other running processes.

All of the information that describes the kernel virtual page tables can be found

in the architecture-dependent header ﬁle <asm/pgtable.h>. This header ﬁle de-

scribes the actions needed to be taken when invalidating and modifying the virtual

page tables, and is the main architecture-speciﬁc ﬁle in the whole memory manage-

ment tree apart from the initial setup code and the low-level trap handling.

To give the reader some idea of the success of the virtual page table approach

it could be noted that most of the current architectures can distill the required

architecture-dependent page table information into a <asm/pgtable.h> ﬁle of only

around 500 lines, much of which is comments or trivial one-line inline functions

describing the hardware page tables.

4.3 Ensuring cache coherency and atomic operations

One of the issues that the original Linux for the Intel line of computers did not have

to concern itself with was CPU cache coherency. Traditionally computers based on

the x86 line of CPUs have always done cache coherency in hardware, and neither

the operating system nor the programs running on the CPU have needed to care

about cache consistency.

However, the Intel CPUs are more the exception than the norm in this regard,

and on most other architectures the operating system will have to be aware of the

caches on the system, and make sure that coherency is maintained in software.

4.3.1

Instruction cache coherency

The simplest form of cache coherency problems arise from the common use of so-

called Harvard architectures [HP96, p. 55], where the instruction and data caches

are separate as in Figure 12.

While a separate instruction and data cache can result in less eﬃcient cache use

due to imbalances in the usage patterns for caches, it also allows the chip designer

30

to de-couple the instruction stream and the data stream, and thus allows for a

simpler and more eﬃcient implementation. Also, while a uniﬁed cache architecture

can theoretically be more eﬃcient in cache utilization, a Harvard style split cache

does guarantee that the instruction cache does not get completely ﬂushed by data

accesses (and vice versa) and can as such work as a balancing factor.

However, while the introduction of separate

instruction and data caches are generally beneﬁ-

D

Data

T

L

B

cial for performance, the split caches also intro-

duce the problem of coherency between the two

separate caches. The issue of instruction cache

coherency is a matter of how to keep the instruc-

tion cache up-to-date with respect to changes

done to the data that it caches.

B

u

s

Control

and

Cache

U

n

i

I

Instr.

T

L

B

ALU

t

Cache

The most obvious solution is to make sure

that the hardware maintains cache coherency,

Figure 12: Harvard caches

and every time any memory data is changed the change is noted and any relevant

instruction cache contents are updated. This is what the original Linux platform

did, and is what Linux was originally designed for.

While hardware support for instruction cache coherency makes life simple for the

programmer and was required for compatibility reasons on the Intel 80x86 platform,

it is not the solution generally preferred by engineers. As a result, most other modern

platforms do not support this kind of hardware coherency.

The main reason for not supporting hardware instruction cache coherency is

that the reason for split instruction and data caches in the ﬁrst place is to be able

to de-couple the instruction stream logic from the data stream logic, and having to

maintain coherency disturbs this de-coupling. In order to maintain coherency the

instruction cache essentially has to snoop all data traﬃc anyway¹⁰, and thus incurs

more complexity than necessary.

More importantly, modern programming languages and paradigms eschew the

use of so-called self-modifying code, and as such at least in theory it should never

happen that the instruction stream is dynamically altered. The engineering stand-

point on the issue thus boils down to the question whether to add complexity to

support something that should never happen in the ﬁrst place. The answer becomes

self-evident.

¹⁰Or, more usually, the instruction and data caches have to maintain some exclusion policy that

guarantees that the same data cannot be in both the data and the instruction cache.

31

Instruction caches under Linux

If modern programming languages and paradigms shun self-modifying code, why is

the issue of instruction cache coherency a problem in the ﬁrst place? Why cannot the

operating system simply ignore the issue, and just refrain from using self-modifying

code at all?

The reason for some form of instruction cache coherency becomes obvious after

some thought. Even though the kernel need not modify itself, the kernel will obvi-

ously have to cause any running programs to be modiﬁed. Each time a new process

is started and the image of the executable is loaded into memory the in-memory

instruction image is modiﬁed, and some form of invalidation is required.

As such, every time a new program is run, the operating system has to support

a limited kind of code modiﬁcations, even though it obviously no longer is a matter

of self-modifying code. Similarly, each time the kernel changes the virtual mapping

of a process it has to invalidate the old instruction caches if there is any possibility

of the new mapping being tainted by old cached data that is no longer valid.

In addition to any mapping changes of a process, the normal way of implementing

user-mode callbacks (signals, in UNIX terms) under Linux is by writing a short

trampoline function on the stack of the process that is used for the return to kernel

mode. While this is not the only possible implementation of user-mode callbacks, it

is a very ﬂexible one, and results in the kernel modifying the user process instruction

stream and thus requires the instruction cache to be made coherent with the memory

image.

Finally, while self-modifying code is frowned upon, no operating system should

enforce its own set of morals upon the user programs unless that enforcement is

required by security issues. As such, self-modifying code in user space should at

least be possible, and may require kernel support. It should be noted that the kernel

in general cannot be expected to know when processes modify their own instruction

stream, and as such this support should not be automatic, but the user should have

some way of requesting kernel help if the hardware does not in itself support any

user-level interface to support instruction cache coherency.

However, despite all these concerns, the issue of instruction cache coherency is

not overly problematic in the end. There are two reasons why instruction cache

coherency can easily and eﬃciently be handled by software. The ﬁrst reason is

inherent in the way instruction caches work: the instruction cache never modiﬁes

the instruction stream. Because the instruction cache is never modiﬁed, maintain-

32

ing coherency is a simple matter of invalidating the current instruction cache and

reading in the new modiﬁed copy. There are no conﬂicting write-back issues or any

need for exclusive access to the instruction cache.

The other reason that instruction cache coherency is reasonably simple to handle

is the fact that the circumstances under which an instruction cache invalidate is

needed are well-deﬁned and easy to pinpoint. Also, while an invalidate operation

can be costly, the circumstances under which it is needed are so rare that it is very

rarely a major performance issue.

Because the occasions where instruction cache invalidates are needed coincide

with virtual memory management changes, the Linux kernel itself need never con-

cern itself with the issues of ﬂushing the instruction cache. If the architecture re-

quires special instruction ﬂush sequences, they can be hidden in the TLB invalidate

code itself, and as such the kernel virtual machine can mostly ignore the issue.

Similarly, the case where instruction cache ﬂushes may be needed due to user-

level callbacks can be handled within the callback code itself, as that is by nature

architecture-speciﬁc in any case. So while the architecture-speciﬁc code needs to be

aware of the instruction cache issues, they do not need to be exported to the virtual

machine level.

4.3.2

Data cache coherency

While instruction cache coherency can be reasonably easily handled, the issues of

coherency within the data cache itself can often be very important. While the

coherency issues with regard to instruction caches stemmed from the fact that they

were de-coupled from the normal data cache, the coherency issues with data caches

can be the result of two diﬀerent circumstances: multiple users of the data and

cache aliases. In both cases the same data exists in multiple places and thus data

coherency issues arise.

Even though the coherency issues are basically the same as with separate in-

struction caches, the matter is much more complicated due to the fact that data

caches are not static entities that do not change. As a result a simple invalidate

of the stale data is no longer acceptable — the coherency issue is complicated by

the need to have sane semantics for the case where there are multiple writers to the

same location.

Additionally, while instruction caches and data caches are not supposed to over-

lap under normal circumstances, the same is by no means true when it comes to data

33

cache duplication. There are no guidelines that allow us to assume that data cache

duplication would be an exceptional and rare occurrence, which further complicates

the issue.

The normal case of data cache coherency problems arises from having two sepa-

rate entities that both access the data. If one or more of the entities cache data for

performance reasons, the fact that the data can exist in multiple places introduces

the problem of coherency when the data is later modiﬁed.

It should be noted that having multiple entities that access the same data does

not necessarily imply multiprocessing per se. A common concern even in uni-

processor environments is the cache coherency issues that are introduced by the

use of I/O devices that use DMA (Direct Memory Access) to read or write the data

directly from memory.

While most systems support cache coherency in hardware, there are again ex-

ceptions to the rule. The architectures that do not support hardware coherency

require that any common data always be accessed through the main memory sub-

system, and the operating system has to make sure that all caches are up-to-date

and ﬂushed to memory before any DMA activity occurs.

4.3.3

The case against virtual data caches

In contrast with the multiple user case, a cache alias occurs when the same location

is found in multiple diﬀerent areas in the same cache, leading to the same coherency

issues as if there were multiple entities.

How do such cache aliases occur? A caching algorithm that places the same

data in diﬀerent locations can be considered an utterly broken caching algorithm,

because not only does it result in the aliasing problem, it can also obviously result

in bad cache utilization if the same area is cached over and over again.

While such a caching strategy can with good reason be considered stupid in

the extreme, it is something that does exist in current hardware, and as such is

something that the kernel needs to know about. The reason for these cache aliases

is generally a virtually indexed data cache.

One of the most basic issues in caching is how the cache is indexed, i.e. how the

cache lookup is done. The two main cache types are virtually and physically indexed

caches. As the names imply, the diﬀerence is whether the virtual or the physical

address of the cached data is used for cache indexing. [HP96, pp. 423]

34

In a physically indexed cache, the cached location is looked up using the physical

address of the memory location, and as such one physical memory address will

always be found through the same index and no aliasing can occur.

In contrast, a virtually indexed cache will use the virtual address as seen by

the running program as the cache lookup index. This has some obvious beneﬁts

for CPU design: the physical address need not even be computed before the cache

indexing can take place, and cache lookups can be made faster and can happen in

parallel with the virtual to physical address translation.

However, the use of virtual caches directly lead to aliasing problems as diﬀerent

virtual addresses can point to the same physical memory location. As a result, the

diﬀerent virtual addresses can result in diﬀerent cache locations being used to cache

the same memory, resulting in the above-mentioned aliasing problem [IKWS92].

It should be noted that virtual indexing and the resulting aliasing is not a prob-

lem for instruction caches: while the aliasing may result in slightly lower utilization

of the cache, this is often balanced by the simpler lookup, and because instruc-

tion caches do not modify their contents the worst eﬀects of the aliasing problem

never materialize. In fact, virtual instruction caches can often make the issue of

instruction cache invalidation (see 4.3.1) simpler for the operating system.

However, in the case of data caches, the aliasing issues either require extra anti-

aliasing hardware or it has to be handled by proper support by the operating system.

As the reason for the virtual indexing in the ﬁrst place was simplicity of hardware,

it is generally left to the operating system to clean up the mess.

The most trivial example of a dual virtual mapping is the case when a page

is available both in kernel space and in user space. As the kernel address space

generally contains all physical memory, this happens with any user page. Happily,

only in a few cases do these mappings actually conﬂict in the cache.

The reason that the double kernel-user mapping normally does not lead to alias-

ing problems is that in most cases a user process gets full control of any page it uses

and the kernel never reads or writes to that page through the kernel address. So

while the mapping exists both in user space and kernel space, this aliasing is not a

problem. Care must be taken that the kernel always uses the user virtual address

for any data copies, but as that is the normal mode of operation anyway the issue

is moot (see later about memory management issues in Chapter 6).

Even though most common memory mappings do not result in aliasing problems,

there are a few cases where this is not true, notably with memory mapped ﬁles.

When a user process asks to map a ﬁle into user memory, the user-accessible page

35

is usually also accessed by the kernel through the page cache that is maintained for

the mapped ﬁle.

As all user programs are loaded into user space as a memory mapping, this dual

mapping of the same physical page onto multiple virtual addresses is extremely

common. However, the overwhelming majority of all mappings are what in UNIX

parlance is known as private mappings, i.e. the mappings are read-only mappings

with regard to the common shared page, and if a process writes to the page, the

virtual memory subsystem will force a trap and a new copy of the page to be

allocated (so-called COW, or copy-on-write behavior).

With private mappings, there is still only one entity (the kernel) that can change

the physical page associated with an aliased mapping, and UNIX memory mapping

semantics do not require such private mappings to be kept 100% synchronized. The

aliasing concern still requires that the kernel be careful the ﬁrst time the kernel copy

of the page has been read, but that is a simple matter of making sure that the data

exists in main memory, and not only in the kernel virtual cache.

In short, for private mappings the data cache aliasing concerns with regard to

cache coherency are not overly problematic. Because of the potential duplication

of the same physical location in multiple cache locations, the virtual cache should

still be larger than an equivalent physically indexed cache for the same performance

[WHL93].

What makes virtually indexed data caches really bad for UNIX (and thus Linux)

are the semantics of shared memory mappings which allow diﬀerent processes to map

in the same physical page in multiple locations, and allow full read and write accesses

to the page. This is mainly used for fast sharing of data between applications, and

is used by the X server with the MIT shared memory extensions and by processes

that want to concurrently access and change the same data in memory, usually

databases.

Because processes are supposed to be able to directly access the shared memory

without the intervention of the operating system, these shared memory mapping

accesses are thus not under any direct control from the OS itself. Thus the op-

erating system generally cannot enforce cache coherency on these kinds of shared

mappings. The approach normally taken by Linux is to disable caching for such

pages altogether, which incurs a very noticeable performance overhead on using

shared mappings.

A possible approach to allow logical shared memory while avoiding disabling the

cache is to use the page tables to serialize the access to the shared pages. The kernel

36

can protect any shared pages and allow only one process at a time to modify them.

This way the kernel controls any shared resources and can make sure no aliasing

ever happens by ﬂushing the caches as appropriate when a new process needs to

access the shared page. Thus the memory is not really ever physically shared, but

the operating system makes it appear that way to user processes.

Using the virtual memory subsystem to serialize access potentially also allows

distributed shared memory machines with this kind of software-imposed cache co-

herency mechanism (so-called shared virtual memory, [AMMR92]). However, the

coherency overhead is huge compared to hardware coherency support, and per-

formance is generally not acceptable for many traditional uses of shared memory

(inter-thread lock handling, for example). In non-distributed systems it is generally

better to disable all caching, as performance in that case will at least be predictable

under diﬀerent circumstances. For this reason no current Linux port uses page table

shared memory serialization.

In general, virtual caches for data should be avoided as far as possible. The

problems with shared memory areas make such caches inherently slower to handle,

and while most applications do not need coherent shared memory, those that do need

it get penalized. Note that because of cache aliases, virtual caches are also likely to

incur more cache misses than physical caches — even without any coherency issues

caused by writes.

It should be noted that most architectures that have virtual caches have only

small virtual caches close to the CPU, where the latency of the address translation

makes more of a diﬀerence. They then mostly have larger second- or third-level

caches that are physically indexed.

What makes the approach of using virtual data caches even more suspect is

that the address translation latency of physical caches can often be hidden with

judicious cache design. One approach used by many physically indexed caches is

to have a ﬁrst-level cache that is indexed through the low-order bits of the virtual

address that do not change during the virtual address translation — the page oﬀset.

This approach limits the size of the cache¹¹, but because the index does not change

during the address translation, the cache lookup and address translation can be

done in parallel [HP96, p. 425].

¹¹Maximum cache size being limited to pagesize_{CP U}× associativity_cache

37

4.4 Multiprocessor issues

While multiprocessing used to be the exclusive domain of supercomputing and large

servers, advances in technology have meant that computers with more than one

processor are becoming increasingly common. That has made these multiprocessor

machines much more attractive to the Linux developers. At the same time, Linux

itself has gained wider acceptance, which in turn makes Linux more attractive even

to people with very high-end multiprocessor hardware.

As a result, Linux has during the last two

years been actively developed for these multi-

CPU

Cache

CPU

Cache

CPU

Cache

processor machines, and the issue of “portabil-

ity” has not only implied changing the underly-

ing CPU architecture, but the possibility to run

Linux on diﬀerent CPU conﬁgurations.

There are generally two separate types of

modern multiprocessing environments, the de-

cisive issue being how the main memory is ac-

cessed. The memory bus is generally the main

limit in any multiprocessing environment (after

Memory

I/O

Figure 13: Centralized shared-

memory multiprocessor

all, it is often the limit even in uniprocessor environments), and the architecture of

the memory subsystem is thus the major design issue.

The more common architecture is

the centralized memory multiprocessor,

CPU

where all processors share the same

memory equally as in Figure 13. This

is generally done with a common mem-

ory and I/O bus, and because all the

processors have direct and equal access

to all of the resources of the machine

it is generally called SMP (Symmetric

Multi-Processing).

Memory

I/O

Memory

CPU

While the shared bus approach is

simple and thus cheap, it has the prob-

lem of not scaling very well to a large

Figure 14: Distributed memory multi-

processor

number of CPU’s. With many CPU’s on the same bus, the bus is quickly saturated

and the approach generally does not scale to more than a dozen CPU’s.

38

Because of the scaling concerns, the other main type of multiprocessor has a

distributed main memory, with each CPU having a local memory (see Figure 14).

The memory bus is no longer a common bottle-neck, and the CPU’s are more

independent of each other. As a result, performance of such a machine scales to

much higher number of processors. The common term for these machines is MPP

(Massively Parallel Processing).

It should be noted that physical distribution of memory does not necessarily

imply that all the processors have separate logical address spaces. Many MPP-type

machines provide for inter-processor communication through a globally addressable

memory space. These kinds of distributed shared-memory machines are often called

NUMAs for Non-Uniform Memory Access; all the memory is globally accessible, but

depending on how far away the CPU is from the memory the access time diﬀers. In

that context SMP machines are generally called UMA (Uniform Memory Access).

Because the two approaches to multiprocessing result in fundamentally diﬀerent

behavior in memory accesses, the Linux kernel has two totally separate strategies

for handling these machines.

4.4.1

Symmetric multi-processing

While porting an operating system from one CPU architecture to another at ﬁrst

glance appears totally diﬀerent from the issue of running the operating system on

multiple CPU’s, the issues concerned are really basically the same. In both cases

it is a matter of making the operating system run eﬃciently on a new hardware

conﬁguration.

To some degree even the problems are similar: the issues of initial bootstrapping

of the system and the hardware details with regard to inter-processor signalling are

in some respects similar to a port to another architecture. Similarly, the cache co-

herency issues in an SMP environment could be seen as just an extension of the

cache coherency issues that can be seen with virtual caches and DMA on unipro-

cessor systems.

But even though there are similarities between porting an operating system

to a new CPU architecture and a “port” from a single CPU to multiple CPU’s,

multiprocessing does add a few unique problems. The main issues are

• Data structure consistency; the need to protect internal kernel data struc-

tures from simultaneous accesses by multiple CPU’s, generally through some

synchronizing primitive.

39

• Allowing application programs to take full advantage of the presence of mul-

tiple processors.

In addition to the above, symmetric multiprocessing also exposes some other

caches that are internal to the CPU, notably the virtual memory address translation

cache (commonly known as a Translation Lookaside Buﬀer or TLB). Keeping the

TLB synchronized with the current virtual memory mapping is trivial on a single-

CPU system, because there is only one CPU that changes the translation, and it

does so under very controlled circumstances.

With multiple CPU’s, the virtual memory mapping changes have to be broadcast

to any other CPU that may be using that mapping, and the CPU’s have to invalidate

the mapping before the original CPU can be sure that it can safely drop the old

map.

Finally, with multiple CPU’s sharing the same memory, the ordering of memory

accesses is important. With caches between the CPU’s and the system memory, the

order in which any update from one CPU arrives at another CPU is by no means

obvious. Especially in high-speed implementations where writes can be delayed on

one CPU and the reading CPU may do aggressive pre-fetching of memory data are

ordering issues very important.

Implementing an SMP kernel

The most obvious problem with any multiprocessing environment is the need to

protect internal data structures from being changed by multiple diﬀerent processors

in an unpredictable manner. What makes the problem subtler is the fact that

while unprotected data structures can be a major problem, the actual symptoms

are generally rarely seen and hard to reproduce.

The basic problem is that updating non-trivial data structures cannot generally

be done atomically, and some way of protecting other processors so that they do

not see partially done updates is needed. While in some cases the problem can be

minimized through using appropriate data structures where the ﬁnal update can

be done with one atomic update (so-called self-locking or lock-free data structures

[PLJ94]), the generic answer is to protect the update through the use of locking.

However, using locks to avoid race conditions on data structures has its own set

of problems — deadlocks and performance. Fine granularity locking can noticeably

40

impact performance if the operating system spends much time acquiring the nec-

essary locks, and care must be taken that there is no possibility for circular lock

dependencies resulting in deadlocks.

To make matters worse the issues of deadlocks and race conditions are often not

well understood even by long-time programmers. Current programming languages

and paradigms have resulted in most programmers having a very linear view of what

happens in any given program, and concurrency is generally not well understood.

As a result, any synchronization method must be simple enough to understand that

subtle pitfalls do not make kernel programming too hard.

The current Linux kernel approach to kernel locks has been to adapt the simplis-

tic approach of just one kernel lock. This approach is similar to the non-reentrancy

rule of traditional uniprocessor UNIX — where a process running in kernel mode

is never pre-empted by another process. The single-lock approach simply extends

this requirement to multiple CPU’s. Each time a kernel service is requested, the

kernel lock must be acquired before entry into the kernel, and as a result internal

data structure integrity is trivially ensured.

While the one lock approach is suﬃcient for ensuring data structure integrity

and also does not lead to deadlock situations, it does have some obvious scalability

problems. The approach is ﬁne for a limited number of CPU’s with computing loads

that spend most of their time in user space with only very occasional excursions

into kernel mode, but it scales very badly and has bad behavior for kernel-intensive

jobs.

As a result, the current Linux kernel is not expected to be truly useful in SMP

environments of more than 2-4 CPU’s, and one of the current concerns is what the

best balance of locking is in the kernel. Work is in progress to split the one lock

ﬁrst into a few smaller locks, and later pinpoint possible contention spots in those.

It is hoped that reasonable scalability can be achieved without excessive numbers

of locks, especially as the current goal for good scalability is for reasonably few

processors. While bus-based hardware can scale to more processors, cost-eﬃciency

seems to be much higher for few processors and four processors seems currently to

be a good goal. For really high-end hardware the solution is generally not SMP in

any case (see chapter 4.4.2).

41

4.4.2

Massively parallel systems

While small-scale SMP shared uniform memory machines are becoming common,

they do not oﬀer the scalability required for really high performance computing.

With tens or hundreds of processors, massively parallel processing systems ﬁll that

niche, and Linux is increasingly attractive for these kinds of systems too.

However, the same hardware scaling considerations that make a uniform global

memory access machine impractical for many processors also make the software

locking approach to SMP impractical. While software could in theory scale better,

practical concerns about deadlocks and lock granularity (see the previous chapter)

make it questionable whether such an approach is viable.

The software solution to the scaling problem is exactly the same as the hardware

solution was: de-coupling the nodes from each other. Instead of running the same

operating system image on each and every node, each node gets a local copy of the

operating system the same way each node has its own memory store. The diﬀerent

operating system kernels communicate with each other over the high-performance

internal network, but can at the same time be seen as independent entities.

Note that SMP and MPP are in no way mutually exclusive approaches, and

indeed one of the most interesting approaches is to combine the two into a hybrid

system. Such a hybrid system would consist of multiple independent nodes con-

nected together with a high-performance network, where each node would contain

a small number of processors in an SMP conﬁguration.

While such a hybrid system running Linux does not yet exist, there actually

are two separate high-performance MPP projects using Linux as the base operating

system. The ﬁrst project was Beowulf [BSS⁺95].

The Beowulf project at NASA CESDIS (Center of Excellence in Space Data

& Information Sciences) research center aims to create cheap high-performance

computing hardware through the use of common oﬀ-the-shelf hardware, and using

Linux as the operating system. The nodes are built up from normal PC hardware,

connected together with 10 and 100Mbps ethernets.

The second project is being developed by the CAP research project at the Aus-

tralian National University in Canberra [TMSW96]. In contrast to the Beowulf

project, this AP1000+ system is a decidedly non-oﬀ-the-shelf MPP made by Fu-

jitsu, using 50MHz SuperSparc CPU’s. Rather than using oﬀ-the-shelf networking,

the AP1000+ has a powerful internal network with very low latencies and high

42

bandwidth. While the maximum conﬁguration is able to incorporate 1024 of these

CPU’s in one machine, Linux has so far been run on only 16, 32 and 64-cell machines.

Despite the very diﬀerent hardware base and the diﬀering aims of the projects,

the basic approach has been similar in both projects. While some support for the

MPP hardware has been added to the basic kernel, most of the distribution is done

in user space through standard parallel programming interfaces (Parallel Virtual

Machine — PVM and Message Passing Interface — MPI). This way most of the

real complexity of a distributed environment can be handled at an appropriate level,

leaving the kernel itself reasonably simple.

Note that this approach is fundamentally diﬀerent from distributed systems

based on microkernels. Rather than handling the complexity of distribution at

a low level in the kernel, the distribution is done almost entirely in user space

with only minimal kernel support. This allows for much greater ﬂexibility, as the

diﬀerent nodes need not necessarily run the same operating system or even have

similar hardware.

4.5 Device drivers: accessing the I/O bus

We have so far only concentrated our attention on the CPU subsystem of the porta-

bility issue. To some degree the CPU subsystem is the most visible part of any ar-

chitecture, and the I/O characteristics of the hardware are often overlooked. Thus,

for example, we talk about Linux/Alpha and Linux/x86 — corresponding to the

CPU architectures Linux has been ported to — but never about Linux/sbus and

Linux/PCI and so on.

One reason that the I/O architecture is often overlooked is that it should never

be seen directly by any application programs. While people are used to having to

have separate versions of their binaries for diﬀerent CPU architectures, the same is

not true when it comes to I/O. The I/O details are expected to be totally hidden by

the operating system, and while diﬀerences in the I/O architecture is expected to

make a diﬀerence for performance, it is not expected to show itself any other way.

Yet, despite this lack of user attention, the I/O details can easily be the largest

part of any portability eﬀort. While compilers and other virtual machine ab-

stractions can hide the CPU details quite eﬀectively, there are generally no “I/O-

compilers” available to hide the details of various devices. As a result, much of the

device driver portability work has to be done by hand, from scratch.

43

What often makes the issue even thornier is the lack of documentation on the

details of the I/O architectures. While all the relevant details of a new CPU archi-

tecture are generally readily available in just one or two architecture manuals from

the manufacturer of the CPU, the same is seldom true of the I/O details. Often the

person doing the port is forced to gather the information from many separate (and

often incomplete) sources.

4.5.1

Proprietary buses

Making the issue of I/O architecture portability harder is not so much the general

complexity of the architecture, as the sheer overwhelming volume of details per-

taining to the I/O subsystem. Only a few years ago, every major manufacturer had

their own I/O bus standards (and many had several diﬀerent standards), the same

way they had their own CPU architecture.

But while the CPU is one well-deﬁned entity, the I/O architecture is not only

deﬁned by the details of actually driving the physical bus, it is also deﬁned by the

devices connected to the bus. Often potentially thousands of diﬀerent devices with

no common interface should be supported. And unlike the CPU, most of those

devices tend to have failure modes that have to be handled and recovered from.

In fact, so overwhelming is the amount of device driver code that almost exactly

half of the current Linux kernel sources is composed of just device drivers. While

few of the drivers on their own are overly complex they are numerous, and a large

portion of the porting eﬀort has been involved with device driver work.

Making all the problems worse is the fact that common code shared between

drivers is usually very scant. There exists various common driver layers, but much

of the common driver layer is not so much concerned with the driver itself as with

the common interface all drivers have to show to the rest of the world.

One example of a generic driver layer is the UNIX tty layer, the layer concerned

with serial input devices like a keyboard or a character terminal. This layer contains

generic code that is required for any tty device — character translation, session

handling and actions to be taken on exceptional events like a hang-up of the line.

However, the entire common code is concerned with issues that are needed for

generic UNIX compatibility, not so much the low-level details of the actual device.

Happily, some real standards do exist in the jungle of I/O handling. The main

such standard is the SCSI (Small Computer Systems Interface), which oﬀers a com-

mon hardware bus standard used by various devices ranging from magnetic and

44

optical disks to scanners and tape drives. While the SCSI standard does not spec-

ify the actual programming interface as seen by the operating system, the common

standard allows for common code to be shared across diﬀerent SCSI adapter drivers.

4.5.2

PCI — the emerging standard

While the situation with respect to I/O architectures is very complex, and very

little common code can generally be shared between diﬀerent I/O subsystems, the

last few years have introduced a new standard that may put an end to at least some

of the chaos. That new standard is the PCI (Peripheral Component Interconnect)

bus standard introduced by Intel and other computer companies.

What is interesting about PCI is not the technology itself, but rather the po-

tential for a reasonably common standard. Already, PCI has been revolutionary in

that it has been able to attract many diﬀerent vendors to use the same standard

rather than continue to use proprietary standards not shared with anybody else.

Though the PCI standard initially was embraced mainly by PC manufacturers, the

large body of available PCI-compliant hardware that it resulted in has made PCI

an extremely attractive proposition to other vendors too.

Currently, PCI is used both by ix86-, PowerPC- and Alpha-based systems, and

lately Sun has announced that PCI-based Sparc workstations are going to be avail-

able in early 1997. What is so impressive about this list is that these four major

architectures not only happen to be the primary Linux platforms, they also account

for a large portion of the modern CPU market. As such, PCI may reduce the

amount of I/O architecture portability work very noticeably.

It should be noted that while a move to PCI makes the portability issues of the

actual I/O bus interface easier, it is by no means the answer to all problems. Most

notably the very large PC hardware market not only made PCI a viable standard

in the ﬁrst place, it has also resulted in unprecedented numbers of diﬀerent adapter

cards being available for PCI. As such, while there is a common bus that avoids

one level of portability problems, it certainly has not taken away the wide variety

of adapters, and thus the need for device driver development in the future.

45

5 Example Architectures

While Linux currently supports seven architectures to some degree or other (Digital

Alpha, Intel 80x86, Motorola 68k, MIPS, PowerPC, Sparc, Acorn ARM), three

of those architectures stand out by having had a fundamental impact on Linux

portability.

The original Intel platform was obviously the base of any Linux work: it oﬀers

cheap and ubiquitous hardware and is still by far the most popular Linux platform

and the one most development is done on.

At the same time, the Digital Alpha platform stands out as a modern 64-bit

hardware base, and being the ﬁrst target of a Linux port. Together with the Sparc

architecture, these two architectures have deﬁned most of the Linux portability work

so far.

5.1 The Alpha

The Digital Alpha architecture is one of the most recent RISC designs in the in-

dustry, and, thanks to extensive support from Digital Corporation, was the ﬁrst

non-Intel architecture supported portably by Linux.

What makes the Alpha such an intriguing platform for Linux is not only the

high performance of Alpha-based computers, but also the fact that an Alpha port

was the ideal second platform from a portability standpoint. Not only is the Alpha

architecture fully 64-bit both from a arithmetic and a memory management stand-

point, it is a very aggressive RISC architecture that looks very diﬀerent from the

original platform for Linux, the Intel x86 [Dig94].

What made the Alpha an even better porting target was the fact that while

having a markedly diﬀerent CPU architecture, Digital was one of the ﬁrst to embrace

PCI as the standard for their I/O bus hardware even for non-PC markets. As a

result the initial port did not have to overly concern itself with device driver issues,

allowing the project to concentrate on the design of the CPU architecture itself.

The Linux/Alpha porting project actually started out as two separate and in-

dependent projects: one pilot project inside Digital itself to prove that Linux could

be run on Alpha hardware, while at the same time the author was given access to

Digital Alpha machines for another port.

While that early porting eﬀort inside Digital was concerned with making Linux

run on the Alpha (somewhat akin to the original Motorola 68k Linux project), work

46

at the Department of Computer Science of the University of Helsinki concentrated

on creating the basis of a truly portable operating system. As such, the Digital

project was working on a 32-bit port on the assumption that it would be easier,

while the author worked on making Linux port cleanly to a full 64-bit architecture

and take full advantage of the hardware provided.

As it turned out, the 32-bit Linux/Alpha port was bootable at an earlier date,

but the portable 64-bit clean rewrite was not much behind and quickly proved itself

to be more versatile. While the Digital Linux eﬀorts have not stopped, they have

abandoned the early 32-bit pilot project and are now working on making Linux

available on diﬀerent Alpha hardware and providing the required environment for

the kernel. An SMP eﬀort is also underway to take advantage of multiprocessing

capabilities of the Alpha.

The main changes introduced by the Alpha port to Linux have been

• Re-organization of the kernel into architecture-independent and architecture-

speciﬁc parts. While the original kernel had used inline assembly statements

embedded in the sources, portability required a clean and well-deﬁned interface

for code that needed to access speciﬁc hardware features.

• Virtual memory management re-write. The wider address space required a

deeper kernel page table tree, and together with generic portability concerns

resulted in the current memory management implementation.

• Kernel type cleanliness. The wider data types required major cleaning up

of the kernel internal data structures. While the original Linux kernel had

worked with the C types “int” and “long” interchangeably and used both of

them for 32-bit values, the Alpha required a clean separation of the diﬀerent

data types used in the kernel.

• Strict alignment of kernel data structures.

In addition, the Alpha port also exposed some unportable assumptions in the

kernel itself. For example, the Linux kernel originally assumed that byte accesses

were atomic, and in some cases used byte values for ﬂagging asynchronous events.

On the Alpha, the smallest atomic update is to a 32-bit word, and all byte accesses

are done with multiple instructions. As a result, code that depended on atomic

updates broke subtly.

There were other subtle issues involved with the Linux port to the Alpha. For

example, the absence of byte accesses is not visible to the C programming level

47

as the compiler will automatically generate the correct instruction sequences, but

for performance reasons contiguous byte accesses should be shunned in favor of

optimized 64-bit code. Rewriting some basic library functionality as optimized

assembly made a noticeable diﬀerence in speed for some uses.

The Linux/Alpha port also introduced the issue of native binary compatibility

with another operating system. Being binary compatible not only gave Linux/Alpha

access to Digital UNIX binaries, it also helped speed up the system bootstrap process

by being able to test out early versions of the operating system against binaries that

were known to work.

The issues of diﬀerent data sizes on the Alpha due to the 64-bit architecture,

together with the binary compatibility requirements, made it important that all the

kernel interfaces were able to handle diﬀerent type sizes transparently. One obvious

example of this was the kernel access to user mode data.

Before the Linux/Alpha port, writing a data item to user mode was done with

the functions put fs byte(), put fs word() and put fs long() — for 8, 16 and

32-bit entities respectively. This mapped well to the original platform data types.

The strange naming of these functions came from the use of the fs segment register

on the Intel platform to access user mode memory.

On the Alpha, this obviously did not work very well. Not only was the naming

non-intuitive on the Alpha as there is no fs segment register, but a data type that

was 16 or 32-bit on the Intel platform might be 32 or 64-bit on an Alpha. As such,

the whole concept of ﬁxed size data had to be scrapped, and the interface to access

user mode was renamed as just put user(). This macro automatically determines

the right data size at compile time and uses the appropriate method for the access.

These kinds of changes made it possible to use the appropriate types on diﬀerent

systems without having to worry about how the actual access happens.

Linux/Alpha, being one of the older ports of Linux, is a very stable platform, and

is used widely in various environments that need superior ﬂoating point performance

or need access to 64-bit hardware. The port has been entirely merged with the

standard Linux sources.

5.2 The Sparc line of computers

While the Alpha architecture lay the groundwork for Linux kernel portability, what

cemented it down was the port to the Sparc line of computers led by David Miller.

48

Two things make the Sparc architecture so interesting in this context. First,

the hardware is readily available: especially older Sparc hardware can generally be

found in various dark corners of any university. In that sense the availability of

hardware was to some degree similar to the availability of standard PC hardware.

The other thing that makes the Sparc an interesting platform is that the Sparc

architecture is one of the oldest RISC architectures, and has gone through several

generations at the hands of several diﬀerent manufacturers. And while the diﬀerent

versions of the architecture look similar to user mode programs, the diﬀerences on

a system programming level are startling. As such the Sparc port in one sense

contained several “micro-ports”, and brought up many issues that had not been

problematic on the previous two architectures [SPA94a, SPA94b, Ros93].

The Sparc, together with the Motorola 680x0 port, also introduced a diﬀerent

byte order for Linux. While both the Intel 80x86 and the Digital Alpha architectures

are little-endian, the Sparc and the m68k are big-endian. Also, not only have

diﬀerent versions of Sparc CPU’s used various diﬀerent virtual memory management

techniques, they have also used both virtual and physical caches or mixtures of them

both.

While this wild variety makes porting diﬃcult, it also makes the Sparc an

excellent testing environment for the kernel virtual machine. The fact that the

Linux/Sparc port not only worked, but actually was very successful indeed is a

good indication that the Linux virtual machine is generic enough to handle any

reasonable hardware implementation.

The biggest changes introduced by the Sparc port to Linux were in memory

management and byte order. The byte order changes were mostly related to ﬁle

system code — notably the native ext2 ﬁle system and the MS-DOS ﬁle system.

The memory management changes were mainly caused by performance and cache

invalidation issues. Both the Intel 80x86 platform and the Alpha handle data cache

coherency issues in hardware, making the Sparc port the ﬁrst to tackle this issue in

software.

Also, while the virtual machine three-level page tables turned out to be a very

good design, performance issues required that the page table TLB invalidation be

made much more granular for the Sparc. The original Linux invalidated the whole

TLB whenever any invalidation was needed, and that was later extended to inval-

idate single page translations in some circumstances. However, on the Sparc this

two-level granularity was inadequate.

49

The reason for needing ﬁner granularity invalidates on the Sparc is the size and

complexity of the TLB. On the Intel 80x86 and the Alpha, the TLB consists of only

the on-chip TLB that can easily and quickly be invalidated and quickly re-ﬁlled.

As such, while some care is needed, TLB invalidation generally is not much of a

performance issue on these architectures. In contrast, on some Sparc hardware the

CPU uses a hash table in memory to expand the on-chip TLB, and invalidating this

is a much more costly operation.

As a result, an in-memory TLB extension as on the Sparc and the PowerPC

requires much ﬁner granularity control over TLB invalidates. The in-memory TLB

is much larger — thousands or even millions of entries as compared to a on-chip

TLB of hundreds of entries at the most.

Similar granularity concerns faced the cache coherency code, resulting in the

following interfaces for ﬂushing caches and TLB entries:

flush cache all(), flush tlb all(): ﬂush all entries under exceptional circum-

stances (at boot-up and when the kernel mapping itself changes).

flush cache mm(), flush tlb mm(): ﬂush a whole process cache and TLB when

the process undergoes radical change like executing a new binary image.

flush cache range(), flush tlb range(): ﬂush a range of pages in a process, for

use when a part of the process map changes due to a new mapping.

flush cache page(), flush tlb page(): ﬂush a single page.

On hardware where cache coherency is handled by hardware, the cache ﬂush

operations will be no-ops and will be removed by the compiler. Similarly, on most

CPU’s with just an on-chip TLB, the TLB range invalidate generally will invalidate

the whole process TLB — the more complex partial invalidation simply is not needed

(nor supported by the hardware).

The Linux/Sparc is now, like the Alpha port, one of the most stable platforms

for Linux, and is used in various real world situations¹².

¹²As of November 1996 most of the Linux speciﬁc mailing list traﬃc has been handled by a

Sparc sun4m machine running Linux at vger.rutgers.edu.

50

6 Future Work

While the portability phase of the Linux development slowly has been calming

down, and most of the issues have been resolved, not all of them are by any means

ﬁnalized.

The basic Linux kernel has been clearly shown to be very portable, as proven by

multiple ports to varied platforms. The diﬀerent platforms supported have diﬀerent

CPU’s, memory interfaces and IO buses, and often have their own software binary

standard requirements as well. All of these issues have generally been resolved

already, and in that sense the porting eﬀort has clearly been a complete success.

However, there still exist implementation issues within the Linux kernel that

directly impact portability concerns. These are often cross-cutting issues where

performance requirements in other areas require new hardware interfaces and thus

extensions to the virtual machine.

One issue in particular is the use of the virtual memory management unit to do

data copying in high-bandwidth applications. With memory speeds often being the

bottle-neck for high performance computing, the use of virtual memory re-mapping

can be a way to avoid physically copying data.

While the current Linux memory management has proved to be an eﬃcient and

portable design, optimization issues like the above lead to new concerns. When

is such an optimization worthwhile, and when should it be discarded in favor of

a traditional physical memory copy? These issues depend not only on how much

of a bottle-neck the memory copy itself is, but also on how expensive the virtual

mapping change can be. That in turn depends on the virtual memory hardware,

but also on the design of the cache subsystem where a virtual cache can result in

new cache coherency issues that have so far not been a concern.

Another area that needs work is the hardware abstraction of multiprocessing in

the Linux kernel. The current SMP eﬀort only supports the Intel 80x86 and the

Sparc platforms, and further work in this area is required. What is the impact of

hardware with no memory coherency on SMP? What is the proper locking gran-

ularity, and what is the impact of hardware with a larger overhead for SMP-safe

locks?

Also, while the basic design has been proven to be very portable, we expect

to spend much of next year on low-level expansion of the ports, and getting the

diﬀerent architectures up to the same level of support that is currently oﬀered on

i386, Alpha and Sparc platforms. The portability work is by no means done.

51

Finally, it should be noted that the portability work has in no way aﬀected

the original Linux/Intel version adversely. Not only is the current portable Linux

totally compatible with the original Linux, but the porting eﬀort has also forced

cleaning up of the kernel code. As such the result is a cleaner kernel with much

better abstraction of the hardware side.

Also, good design of the virtual machine has not only avoided a negative perfor-

mance impact — performance has actually improved. This is mainly due to other

performance-related eﬀorts done in parallel with the portability work, but the bet-

ter abstractions required by portability issues have resulted in better interfaces that

have been easier to optimize.

The original three design goals — simplicity, eﬃciency, and compatibility — have

been joined by a fourth one, portability.

52

References

[AMMR92] R. Ananthanarayanan, Sathis Menon, Ajay Mohindra, and Umakishore

Ramachandran. Experiences in integrating distributed shared memory

with virtual memory management. In ACM Operating Systems Review,

volume 26, pages 4–26, July 1992.

[App96]

Apple Computer, Inc. MkLinux: Linux for Power Macintosh, 1996.

http://www.mklinux.apple.com/, checked January 24, 1997.

[B¨ac96]

Ralf B¨achle. The Linux/MIPS FAQ, 1996. http://www.unix-ag.uni-

siegen.de/~nils/mips/linux-mips.html, checked January 26, 1997.

[BGG⁺91] Allan Bricker, Michel Gien, Marc Guillemont, Jim Lipkis, Douglas

Orr, and Marc Rozier. A new look at microkernel-based UNIX op-

erating systems: Lessons in performance and compatibility. In Pro-

ceedings of the EurOpen Spring 1991 Conference, Norway, May 1991.

ftp://ftp.chorus.fr/pub/reports/CS-TR-91-7.ps.Z, checked Decem-

ber 31, 1996.

[BSS⁺95]

Donald J. Becker, Thomas Sterling, Daniel Savarese, John E. Dor-

band, Udaya A. Ranawak, and Charles V. Packer.

a parallel workstation for scientiﬁc computation.

Beowulf:

In Proceed-

ings, International Conference on Parallel Processing, USA, 1995.

http://cesdis1.gsfc.nasa.gov/linux-web/papers/ICPP95/final.ps,

checked December 31, 1996.

[CG87]

[Dig92]

John H. Crawford and Patrick P. Gelsinger. Programming the 80386.

SYBEX, USA, 1987.

Digital Equipment Corporation, Maynard, Massachusetts. Alpha Ar-

chitecture Reference Manual, Part III: DEC OSF/1 Alpha Software,

1992.

[Dig94]

[Dou96]

Digital Equipment Corporation, Maynard, Massachusetts. Alpha AXP

Architecture Handbook, 1994.

Cort Dougan. Linux for PowerPC, 1996. http://www.cs.nmt.edu/

~cort/linuxppc/, checked January 26, 1997.

53

[FL94]

Bryan Ford and Jay Lereau.

thread model. In Proceedings of the Winter 1994 USENIX Confer-

ence, pages 97–114, USA, January 1994. http://www.usenix.org/

Evolving Mach 3.0 to a migrating

publications/library/proceedings/sf94/ford.html, checked Decem-

ber 31, 1996.

[GC94]

[GM95]

Berny Goodheart and James Cox. The Magic Garden Explained: the

Internals of UNIX System V Release 4. Prentice Hall, Australia, 1994.

James Gosling and Henry McGilton. The Java Language Environment.

Sun Microsystems Computer Company, Mountain View, CA, USA, Oc-

tober 1995. ftp://ftp.javasoft.com/docs/whitepaper.A4.ps.tar.Z,

checked January 24, 1997.

[Hoh96]

[HP96]

Michael Hohmuth.

Linux-Emulation auf einem Mikrokern, 1996.

http://www.inf.tu-dresden.de/~mh1/prj/linux-on-l4/, checked Jan-

uary 24, 1997.

John L. Hennessy and David A. Patterson. Computer Architecture:

A Quantitative Approach. Morgan Kaufmann Publishers, Inc., San

Francisco, California, second edition, 1996.

[IKWS92]

Jon Inouye, Ravindranath Konuru, Jonathan Walpole, and Bart Sears.

The eﬀects of virtually addressed caches on virtual memory design and

performance. In ACM Operating Systems Review, volume 26, pages

14–29, October 1992.

[Ins96]

The Institute of Electrical and Electronic Engineers, Inc., USA. Infor-

mation technology—Portable Operating System Interface (POSIX)—

Part 1: System Application Program Interface (API) [C Language],

1996.

[Int91]

Intel Corporation, USA. Intel386 Family Binary Compatibility Speciﬁ-

cation 2, 1991.

[Int93a]

Integrated Device Technology,

Inc.,

Santa Clara,

California.

IDT79R4600 ORION Processor Hardware User’s Manual version 1.0,

October 1993.

54

[Int93b]

[Kin96]

[Law97]

International Business Machines Corporation, USA. PowerPC Archi-

tecture, ﬁrst edition, May 1993.

Russell King. ARM Linux, 1996. http://whirligig.ecs.soton.ac.uk/

~rmk92/armlinux.html, checked January 26, 1997.

Chris Lawrence.

The Linux/m68k Home Pages, 1997.

http:

//www.clark.net/pub/lawrencc/linux/index.html, checked January

24, 1997.

[LHFL93]

Jay Lepreau, Mike Hibler, Bryan Ford, and Jeﬀrey Law. In-kernel

servers on Mach 3.0: Implementation and performance. In Proceed-

ings of the Third USENIX Mach Symposium, pages 39–55, USA, April

1993.

http://www.usenix.org/publications/library/proceedings/

mach3/lepreau.html, checked December 31, 1996.

[PLJ94]

[Ros93]

Sundeep Prakash, Yann Hang Lee, and Theodore Johnson. A non-

blocking algorithm for shared queues using compare-and-swap. In IEEE

Transactions on Computers, volume 43, pages 548–559, May 1994.

Ross Technology, Inc., Austin, Texas. SPARC RISC User’s Guide Hy-

perSparc Edition, 1993.

[SPA94a]

[SPA94b]

SPARC Technology Business, USA. microSPARC User’s Manual, 1994.

SPARC Technology Business, USA. SuperSPARC and MultiCache Con-

troller User’s Manual, 1994.

[Tan92]

Andrew S. Tanenbaum. Modern Operating Systems. Prentice-Hall In-

ternational, Inc., Englewood Cliﬀs, New Jersey, 1992.

[TMSW96] Andrew Tridgell, Paul Mackerras, David Sitsky, and David Walsh.

AP/Linux — a modern OS for the AP1000+. In 6^thParallel Com-

puting Workshop, pages p2–c–1 – p2–c–9, Japan, November 1996.

http://cap.anu.edu.au/cap/projects/linux/aplinux-pcw96.tex,

checked December 31, 1996.

[Tor93]

Linus Torvalds. Linux—a case study in free software development. In

Proceedings of the NLUUG autumn conference 1993, pages 133–140,

Ede, The Netherlands, November 1993.

55

[WHL93]

C. Eric Wu, Yarsun Hsu, and Yew-Huey Liu. A quantitative evalua-

tion of cache types for high-performance computer systems. In IEEE

Transactions on Computers, volume 42, pages 1154–1162, November

1993.

56