|
Welcome to ShortScience.org! |
|
|
[link]
The Fast Filesystem (FFS) improved the read and write throughput of the original Unix file system by 10x by 1. increasing the block size, 2. dividing blocks into fragments, and 3. performing smarter allocation. The original Unix file system, dubbed "the old file system", divided disk drives into partitions and loaded a file system on to each partition. The filesystem included a superblock containing metadata, a linked list of free data blocks known as the free list, and an inode for every file. Notably, the file system was composed of 512 byte blocks; no more than 512 bytes could be transfered from the disk at once. Moreover, the file system had poor data locality. Files were often sprayed across the disk requiring lots of random disk accesses. The "new file system" improved performance by increasing the block size to any power of two at least as big as 4096 bytes. In order to handle small files efficiently and avoid high internal fragmentation and wasted space, blocks were further divided into fragments at least as large as the disk sector size. ![]() |
|
[link]
Ed Codd proposed the relational model in 1970. As opposed to the navigational data models that came before it, the relational model boasted data independence: the ability for data storage and access methods to change independently of applications. Some worried that data independence necessitated poor performance. System R was one of the first relation databases and proved that the relational model could be implemented efficiently. System R development proceeded in three phases. Phase 0 (1974-1975) was a single-user PL/I interpreter prototype that processed a subset of SQL (e.g. no joins) using the XRM access method. The Phase 0 prototype was always intended to be thrown away. Instead, the focus was on tuning the user interface SQL. User studies and interviews were performed to increase the usability and understandability of SQL. Every tuple in the database was labelled with a TID which contained a page number. Each tuple contained pointers into separate domains, and inversions existed to map domain values to TIDs. The Phase 0 query optimizer aimed to minimize the number of fetched tuples and would perform tricks like TID intersection to evaluate conjunctions. The prototype also introduced the design that the system catalog should be stored as relations. Phase 0 brought about the following ideas: 1. The optimizer should consider more than the cost of fetching tuples. It should also take into account the costs of TID manipulation, data fetching, etc. 2. Number of I/Os would have been a better metric than the number of tuples fetched. This would have also exposed the deficiency of the XRM access method. 3. The Phase 0 optimizer was CPU bound! This encouraged the later optimizer to be a weighted cost of CPU and I/O. 4. SQL joins are very important. 5. The query optimizer was complicated; more care should be given towards simpler and more common queries. Phase 1 ranged from 1976 to 1977 and included the implementation of a full blown multi-user relational database. Phase 1 was divided into two pieces: 1. The Relational Data System (RDS) was an optimizing SQL processor responsible for query optimization. 2. The Research Storage System (RSS) was the access method that replaced XRM and was responsible for things like locking and logging. Users could query System R using interactive queries or by embedding SQL queries in PL/I or Cobol. A preprocessor would compile the embedded SQL queries into an access module using a repository of hand-compiled fragments. Of course, the compiled query plan could be invalidated over time. For example, the query plan could use an index which is later dropped. Thus, each query's dependencies were put in the system catalog and queries were recompiled when their dependencies were invalidated. Unlike the XRM, the RSS stored data directly in the tuples. This meant that certain column values were stored redundantly, but an entire row could be read in a single I/O. RSS also supported B+ tree indexes, tuple links, index scans, full table scans, link scans, tuple nested loop joins, index nested loop joins, and sort merge joins. The query optimizer minimized a weighted sum of RSS calls and I/Os using a dynamic programming approach. It avoided using some of the TID list intersection tricks that the Phase 0 optimizer used. Views were stored as parse trees and merged back into the SQL queries used to query them. Updates were only allowed on single-table views. Views were the atomic unit of authorization using a grant/revoke mechanism. System R used a combination of logging and shadow pages to implement recovery. During recovery, pages were restored to their old shadow pages, and the log was processed backwards. Since Phase 1 was a multi-user database, it introduced multiple granularity locking in the form of intension locks. Originally, it had predicate locking, but this was abandoned because it was (1) difficult to check for predicate disjointness, (2) predicates were sometimes falsely marked as overlapping, and (3) predicate locking broke the abstraction barrier of the RSS. Phase 2 was a two-year period in which System R was evaluated. Users generally enjoyed the uniformity of SQL, and their recommendations led to the introduction of EXISTS, LIKE, prepared statements, and outer joins. The query optimizer was evaluated assuming that data was uniformly distributed and that all columns were independent. Shadow pages led to poor locality, extra bookkeeping, and semi-expensive page swapping. System R provided read uncommitted, read committed, and full serializable transactions. Read uncommitted wasn't implemented as fast as it should have been. Read committed had more overhead than expected. Serializable transactions ended up being the most commonly used. #### Commentary System R introduced a bevy of influential and perennial ideas in the field of databases. Unix introduced a bevy of influential and perennial ideas in the field of operating systems. It's no coincidence that there are a striking number of system design principles that System R and Unix---as presented in The Unix Time-Sharing System---share: 1. Unified Abstractions. Unix unified the file and I/O device abstraction into a single interface. System R unified the table and catalog/metadata API into a single interface (i.e. everything is a relation). System R also unifed SQL as the query language used for ad-hoc queries, program-embeded queries, and view definitions. System R's decision to use relations to represent the catalog can also be seen as a form of dogfooding. 2. Simple is Better. Unix started as Ken Thompson's pet project as an effort to make development simpler and more enjoyable. Unix's simplicity stuck and was one of its selling points. Similarly, System R spent a considerable amount of effort simplifying the SQL interface to make it as easy to use as possible. If a system's interface is too complicated, nobody will use it. 3. Performance isn't Everything. Thompson and Ritchie implemented Unix in C instead of assembly despite the fact that the kernel size increased by one third because C greatly improved the readability and maintainability of the system. Similarly, the System R paper comments that the relational model may never exceed the performance of a hand-optimized navigational database, but the abstraction it provides is worth the cost. Somewhat comically, today's compilers and query optimizers are so good, compiled C is likely smaller than hand-written assembly and optimized queries are likely faster than hand-optimized ones. This is an example of another systems principle of favoring higher-level declarative APIs which leave room for optimization. ![]() |
|
[link]
This paper introduces the B-link tree: a variant of a B+ tree (the paper says B* tree, but they mean B+ tree) which allows for concurrent searches, insertions, and deletions. Operations on a B-link tree lock at most three nodes at any given time, and searches are completely lock free.
#### Storage Model
We assume that every node of a B+ tree or B-link tree is stored on disk. Threads read nodes from disk into memory, modify them in memory, and then write them back to disk. Threads can also lock a specific node which will block other threads trying to acquire a lock on the same node. However, we'll also see that threads performing a search will not acquire locks at all and may read a node which is locked by another thread.
#### Concurrent B+ Trees
Let's see what goes wrong when we concurrently search and insert into a B+ tree without any form of concurrency control. Consider a fragment of a B+ tree shown below with two nodes x and y. Imagine a thread is searching for the value 2 and reads the pointer to y from x.
```
+-------+
x | 5 | |
+-------+
/ |
+-------+ ...
y | 1 | 2 |
+-------+
```
Then, another thread inserts the value 3 which reorganizes the tree like this:
```
+-------+
x | 2 | 5 |
+-------+
/ | \
+-------+ +-------+ ...
y | 1 | | | 2 | 3 |
+-------+ +-------+
```
Next, the searching thread reads from y but cannot find the value 2!
Clearly, concurrently searching and inserting into a B+ tree requires some sort of locking. There are already a number of locking protocols:
- The simplest protocol requires searches and insertions to lock every node along their path from root to leaf. This protocol is correct, but limits concurrency.
- Smarter protocols have insertions place write intention locks along a path and upgrade those locks to exclusive locks when performing a write. Searches can read nodes with write intention locks on them but not with exclusive locks on them.
- Even smarter protocols lock a subsection of the tree and bubble this subsection upwards through the tree. B-link trees will do something similar but will guarantee that at most three nodes are locked at any given point in time.
#### B-link Trees
- Typically, an internal node in a B+ tree with $n$ keys has $n + 1$ pointers. For example, if an internal node has keys $(5, 10, 15)$, then it has four pointers for values in the range $[-\infty, 5)$, $[5, 10)$, $[10, 15)$, and $[15, \infty)$. Internal nodes in a B-link tree with $n$ keys have $n$ pointers where the last key is known as the <strong>high key</strong>. For example, if an internal node has keys $(5, 10, 15)$ then it has three pointers for values in the range $[-\infty, 5)$, $[5, 10)$, and $[10, 15)$.
- In a B+ tree, leaves are linked together, but internal nodes are not. In a B-link tree, all sibling nodes (internal nodes and leaves) are linked together left to right.
#### Search Algorithm
The search algorithm for a B-link tree is very similar to the search algorithm for a B+ tree. To search for a key $k$, we traverse the tree from root to leaf. At every internal node, we compare $k$ against the internal node's keys to determine which child to visit next.
However, unlike with a B+ tree, we might have to walk rightward along the B-link tree to find the correct child pointer. For example, imagine we are searching for the key $20$ at an internal node $(5, 10, 15)$. Because $20 \gt 15$, we have to walk rightward to the next internal node which might have keys $(22, 27, 35)$. We do something similar at leaves as well to find the correct value.
Note that searching does not acquire any locks.
#### Insertion Algorithm
To insert a key $k$ into a B-link tree, we begin by traversing the tree from root to leaf in exactly the same way as we did for the search algorithm. We walk downwards and rightwards and do not acquire locks. One difference is that we maintain a stack of the rightmost visited node in each level of the tree. Later, we'll use the stack to walk backwards up the tree.
One we reach a leaf node, we acquire a lock on it and crab rightward until we reach the correct leaf node $a$. If $a$ is not full, then we insert $k$ and unlock $a$. If $a$ is full, then we split it into $a'$ (previously $a$) and $b'$ (freshly allocated). We flush $b'$ to disk and then flush $a'$ to disk. Next, we have to adjust the parent of $a'$ (formerly $a'$). We acquire a lock on the parent node and then crab rightward until we reach the correct parent node. At this point, we repeat our procedure upwards through the tree.
At worst, we hold three locks at a time.
#### Correctness Proof
To prove that the B-link tree works as we intend, we have to prove three things:
- First, we have to prove that multiple threads operating on a B-link tree cannot deadlock. This is straightforward. If we totally order nodes bottom-to-top and left-to-right, then threads always acquire locks according to this total order.
- Second, we have to prove that whenever a node modifies a B-link tree, the B-link tree still appears like a valid tree to all nodes except the modifying thread. Again, this is straightforward. The insertion procedure only makes three writes to disk.
- Writing to a node that is not full clearly does not invalidate the tree.
- Writing a newly allocated $b'$ node does not affect the tree because there are not any pointers to it.
- Writing $a'$ atomically transitions the tree from one legal state to another.
- Finally, we have to ensure that concurrently executing operations do not interfere with one another. See paper for proof (it's complicated).
#### Deletion
B-link trees punt on deletion. They simply let leaf nodes get underfull and periodically lock the whole tree and reorganize it if things get too sparse.
![]() |
|
[link]
In 1980, synchronization primitives like semaphores, monitors, and condition variables had been well studied in the literature, but there weren't any large systems implementing them. Mesa was a programming language that was being developed to write the Pilot operating system at Xerox. Due to the inherent concurrency of an operating system, Mesa was designed to ease the development of concurrent programs. The Mesa designers considered implementing a message passing interface, but deemed it too complex. They considered semaphores, but found them too undisciplined. They considered cooperative multi-threading but came upon a number of serious disadvantages:
- Cooperative multithreading cannot take advantage of multiple cores.
- Preemption is already required to service time-sensitive I/O devices.
- Cooperation is at odds with modularity. Critical sections have no principled way of knowing if they are calling a function which yields control.
Eventually, Mesa settled on implementing monitors and condition variables and exposed a number of previously undiscussed issues:
- What is the interface for dynamically spawning threads and waiting for them to terminate?
- What is the interface for dynamically constructing monitors?
- How are threads scheduled when waiting and notifying each other?
- What are the semantics of wait when one monitor calls into another monitor which also calls wait?
- How are exceptions and I/O devices handled?
Mesa allowed an arbitrary function call to be forked and run by a separate thread and eventually joined:
https://i.imgur.com/McaDktY.png
Moreover, if a forked thread was not intended to be joined, it could instead be detached via detach[p]. This fork/join style process management had a number of advantages---(i) processes were first class values, (ii) thread forking was type checked, and (iii) any procedure could be forked---but also introduced lots of opportunities for dangling references.
Monitors are objects that only allow a single thread to be executing one of its functions at any given time. They unify data, synchronization of the data, and access of the data into one lexical bundle. Mesa monitors included public entry preocedures and private internal procedures that operated with the monitor locked as well as public external procedures that operated without locking the monitor. Monitors, in conjunction with condition variables, were used to maintain an invariant about an object that was true upon entering and exiting any of the monitor's methods. Monitors also lead to potential deadlocks:
- Two monitor methods could wait for one another.
- Two different monitors could enter each other.
- A monitor M could enter a monitor N, then wait on a condition that could only be enabled by another thread entering M through N.
Special care also had to be taken to avoid priority inversion.
Mesa also introduced Mesa semantics, best explained with this code snippet:
```
"""
Condition variables typically obey one of two semantics:
1. Under *Hoare semantics* [1], when a thread calls `notify` on a condition
variable, the execution of one of the threads waiting on the condition
variable is immediately resumed. Thus, a thread that calls `wait` can assume
very strong invariants are held when it is awoken.
2. Under *Mesa semantics* [2], a call to `notify` is nothing more than a hint.
Threads calling `wait` can be woken up at any time for any reason.
Understanding the distinction between Hoare and Mesa semantics can be
solidified by way of example. This program implements a concurrent queue (aka
pipe) to which data can be written and from which data can be read. It spawns a
single thread which iteratively writes data into the pipe, and it spawns
`NUM_CONSUMERS` threads that read from the pipe. The producer produces the same
number of items that the consumers consume, so if all goes well, the program
will run and terminate successfully.
Run the program; not all goes well:
Exception in thread Thread-3:
Traceback (most recent call last):
File "/usr/lib/python2.7/threading.py", line 810, in __bootstrap_inner
self.run()
File "/usr/lib/python2.7/threading.py", line 763, in run
self.__target(*self.__args, **self.__kwargs)
File "hoare_mesa.py", line 66, in consume
pipe.pop()
File "hoare_mesa.py", line 52, in pop
return self.xs.pop(0)
IndexError: pop from empty list
Why? The pipe is implemented assuming Python condition variables obey Hoare
semantics. They do not. Modify the pipe's implementation assuming Mesa
semantics and re-run the program. Everything should run smoothly!
[1]: https://scholar.google.com/scholar?cluster=16665458100449755173&hl=en&as_sdt=0,5
[2]: https://scholar.google.com/scholar?cluster=492255216248422903&hl=en&as_sdt=0,5
"""
import threading
# The number of objects read from and written to the pipe.
NUM_OBJECTS = 10000
# The number of threads consuming from the pipe.
NUM_CONSUMERS = 2
# An asynchronous queue (a.k.a. pipe) that assumes (erroneously) that Python
# condition variables follow Hoare semantics.
class HoarePipe(object):
def __init__(self):
self.xs = []
self.lock = threading.Lock()
self.data_available = threading.Condition(self.lock)
# Pop the first element from the pipe, blocking if necessary until data is
# available.
def pop(self):
with self.lock:
# This code is incorrect beacuse Python condition variables follows
# Mesa, not Hoare, semantics. To correct the code, simply replace
# the `if` with a `while`.
if len(self.xs) == 0:
self.data_available.wait()
return self.xs.pop(0)
# Push a value to the pipe.
def push(self, x):
with self.lock:
self.xs.append(x)
self.data_available.notify()
def produce(pipe):
for i in range(NUM_OBJECTS):
pipe.push(i)
def consume(pipe):
assert NUM_OBJECTS % NUM_CONSUMERS == 0
for i in range(NUM_OBJECTS / NUM_CONSUMERS):
pipe.pop()
def main():
pipe = HoarePipe()
producer = threading.Thread(target=produce, args=(pipe,))
consumers = [threading.Thread(target=consume, args=(pipe,))
for i in range(NUM_CONSUMERS)]
producer.start()
for consumer in consumers:
consumer.start()
producer.join()
for consumer in consumers:
consumer.join()
if __name__ == "__main__":
main()
```
Threads waiting on a condition variable could also be awoken by a timeout, an abort, or a broadcast (e.g. notify_all).
Mesa's implementation was divided between the processor, a runtime, and the compiler. The processor was responsible for process management and scheduling. Each process was on a ready queue, monitor lock queue, condition variable queue, or fault queue. The runtime was responsible for providing the fork/join interface. The compiler performed code generation and a few static sanity checks.
Mesa was evaluated by Pilot (an OS), Violet (a distributed calendar), and Gateway (a router).
![]() |
|
[link]
Lauer and Needham explain the duality in expressiveness and performance between - message-oriented concurrency models in which there are a small number of fixed tasks that communicate explicitly, and - process-oriented concurrency models in which there are a larger number of dynamic processes that share memory. Message-oriented systems can be characterized by the following hallmarks, consequences, and provided facilities. | Hallmark | Consequences | Facilities | |------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------|---------------------------------------------------------------------------------------------------------| | Long standing communication channels are typically created during program initialization | Synchronization is implicitly performed in the queues connecting processes | Messages and message ids | | There are a fewer number of long lasting processes | Shared structures are passed by reference; only processes with a reference to a structure can act on it | Message channels and ports that provide the ability to Send, WaitForReply, WaitForMessage, or SendReply | | Processes don't share memory | Peripheral devices are treated like processes and communicated with | Process creation (but no deletion) | | Processes read a small number of messages at a time | | | Process-oriented systems can be similarly characterized: | Hallmark | Consequences | Facilities | |----------------------------------------------------------|----------------------------------------------------------------------|--------------------------------| | Global data can be protected and accessed via interfaces | Synchronization is performed in locks | Procedures | | Process creation and deletion is a lightweight task | Data is shared directly, with small portions being locked | fork/join procedure invocation | | Processes typically have a single job | Peripheral interaction typically involves locking and sharing memory | Modules and monitors | | Module instantiation | | | | Condition variables | | | There is a duality between the two concurrency models. Any program in one has a corresponding program written in the other. Lauer and Needham demonstrate the duality not by simulating model's primitives using the other, but by drawing similarities between the two's components: | Message-oriented | Process-oriented | |-----------------------------------------|-----------------------------------| | Processes, CreateProcess | Monitors, NEW/START | | Message Channels | External Procedure identifiers | | Message Ports | ENTRY procedure identifiers | | SendMessage; AwaitReply | simple procedure call | | SendMessage; ... AwaitReply | FORK; ... JOIN | | SendReply | RETURN | | WaitForMessage loop with case statement | monitor lock, ENTRY attribute | | arms of case statement | ENTRY procedure declarations | | selective waiting for messages | condition variables, WAIT, SIGNAL | This correspondence can be used to directly rewrite a canonical program between the two models. In fact, the differences between the two models becomes simply a matter of keyword choice. Moreover, if both models are implemented with identical blocking and scheduling mechanism, then the two models lead to identical performance as well. Since the choice of model does not affect the user or implementor, the decision of which model to use should be based on the architecture of the underlying hardware. ![]() |