Bigtable — Amos Hebb

Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber [Cha+08]

Bottom Line Up Front

Summary

Introduction

Distributed storage system. PB scale. Very applicable, scalable, performant, and available. 60 projects using it. Bigtable is like a database but not relational. Everything is a string, clients need to cope with it.

Data Model

Sparse, distributed, persistent multi-dimensional sorted map. Indexed by (row: string, column: string, time: int64) -> string. Useful for somebody (like Google perhaps?) who wants to keep a copy of a large collection of web pages and related information. Use URLs as row keys, aspects of pages as column keys, and dump entire web page in.

Arbitrary length on keys. (they pick 64 KB max, in practice, < 100 bytes typical). Every read or write under a row is atomic.

Column keys are grouped into column families. Unbounded number of cols in < 1000 families. Keys use syntax family:qualifier. Example here is language, storing multi-lingual websites in a row by URL and language:english, language:french columns.

Timestamps are just now() sent by the client. Applications deal with collisions.

Bigtable does provide some maid services for last

𝑛

versions, drop after

𝑛

days. Again, useful for caching webpages.

API

Clients can write, delete, read, read-range, atomic read-modify-write, batching writes, auto-incrementers, running client-supplied scripts in address space of servers. Plays nice with Sawzall [Pik+05], MapReduce [DG04].

Building Blocks

SSTable is the file format. Persistent, ordered immutable map from keys to values. Internally, SSTable has 64 KB blocks. Block index is loaded into memory. Lookups are 1 disk seek. Binary search of in-memory index, go straight to block. SSTable can also be in memory.

Bigtable also relies on Chubby [Bur06] (log that never dies). Leases, appointing master, bootstrapping location of data, discover tablet servers, finalize tablet deaths, schema info, access control lists. Chubby dies 0.0047% of the time, 0.0326% worst example.

Master assigns tablets to tablet servers. Takes attendance, balances, garbage collects, schema changes, column family creations. Data does not go through master. Clients don’t often talk to master, master isn’t bottleneck.

Chubby contains root tablet, root tablet contains (never split) METADATA tablets, pointing to tables. 3 levels is plenty.

Tablet Assignment

Each tablet is assigned to one tablet server at a time. Master keeps track of live tablet servers. Master sends tablet load requests to tablet servers. New tablet servers sign up at Chubby, master watches this list. Tablet servers kill themselves if they get confused. Master detects if tablet server is no longer serving, and just switches to another. Master kills itself if Chubby stops responding, master failures do not change assignment of tablets to tablet servers.

Master startup

New tablets

Master can still learn about it even if commit is lost when it asks the tablet server what it has.

Tablet Serving

Bibliography

[Cha+08] F. Chang et al., “Bigtable: A Distributed Storage System for Structured Data,” ACM Transactions on Computer Systems, vol. 26, no. 2, p. 4:1–4:26, 2008, doi: 10.1145/1365815.1365816.
[Pik+05] R. Pike, S. Dorward, R. Griesemer, and S. Quinlan, “Interpreting the Data: Parallel Analysis with Sawzall,” Scientific Programming, vol. 13, no. 4, pp. 277–298, 2005.
[DG04] J. Dean and S. Ghemawat, “MapReduce: Simplified Data Processing on Large Clusters,” in Proceedings of the 6th Symposium on Operating System Design and Implementation, San Francisco, CA, 2004, pp. 137–150.
[Bur06] M. Burrows, “The Chubby Lock Service for Loosely-Coupled Distributed Systems,” in Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation, Seattle, WA, 2006, pp. 335–350.