Practical Course: Database Implementation
Content
Short version: In this practical course you will implement a main-memory database system from scratch.
Relational database systems are ubiquitous. Despite being designed decades ago, the fundamental architecture of these general-purpose systems has been virtually unchanged. In particular, these systems have been optimized for minimizing the number of disk I/O operations. In recent times the increasing amount of RAM and fundamental changes to the hardware landscape like the proliferation of multi-core CPUs have changed the underlying performance tradeoffs. The standard query evaluation model ("iterator model" based on interpretation), for example, incurs very large overheads when data mostly resides in RAM. While modern column stores, which enable very fast query processing but are not suitable for online transaction processing, have mostly converged on a common design, this is not the case for hybrid main-memory OLAP/OLTP systems. In this project we instead focus on such hybrid systems, where the architecture differs more between the different proposals.
As a consequence to these trends, core database systems research focusing on how to optimize database systems for modern hardware has proliferated. To design a high-performance database system for modern hardware, two fundamental high-level approaches can are possible. Some researchers start with traditional systems, try to find the bottlenecks, and optimize them one-by-one. The advantage of this approach is that one has a fully-featured system as a basis (e.g., the open source PostgreSQL system or the storage manager Shore). The disadvantage is that one is still constrained by the old architecture risking to get stuck in a "local minimum".
In this course we take the second, opposite approach. We to start with a very fast, but limited, kernel of database functionality and gradually extend the functionality to make it more "database-like". Specifically, we start by hand-coding the core parts of TPC-C, a common transaction processing (OLTP) benchmark, in C++. This enables very high performance from the start and allows to focus attention on fundamental design decision like how to store and index data most efficiently. It also allows for easy experimentation with the tradeoffs between row and column stores. We then focus on efficient analytical query processing. To this end, we again manually implement queries based on the previously designed physical database storage format and again without being constrained without being constrained by architectural design decisions.
In the second phase, we use compilation as our main technique to automatically generate the code that previously hand-coded. Instead of using the iterator model, this approach compiles queries directly to fast code that consists of tight loops and directly operates on the data representation. We use compilation both to generate the database storage data structures and access methods as well as code for individual queries. To compile complex, analytical multi-operator queries, which are generally represented as trees of algebra operators, we implement a generic query compiler that compiles a tree of algebra operators to machine code. For fast prototyping, we use C++ as the intermediate language, which is compiled at runtime by calling a C++ compiler.
Prerequisites
- Lecture Fundamentals of Databases (IN0008) or similar course
- Systems programming experience in C/C++
- Beneficial: lecture Database Systems on Modern CPU Architectures (IN2118)
- The course will be in English.
Organization
- In order to take this course, you have to complete this simple task, which is meant to test if you have the prerequisites for the course. Send the solution to [email protected] as early as possible but before 10 October. Once we have received the solution, we will confirm that you have a spot. Note that you have to complete this task regardless whether you were assigned a spot for this course by the matching system or not.
- GitLab Group
- Mattermost Team