2023-03-01
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Gruber [Cha+08]
I rate this
Distributed storage system. PB scale. Very applicable, scalable, performant, and available. 60 projects using it. Bigtable is like a database but not relational. Everything is a string, clients need to cope with it.
Sparse, distributed, persistent multi-dimensional sorted map. Indexed by (row: string, column: string, time: int64) -> string. Useful for somebody (like Google perhaps?) who wants to keep a copy of a large collection of web pages and related information. Use URLs as row keys, aspects of pages as column keys, and dump entire web page in.
Arbitrary length on keys. (they pick 64 KB max, in practice, < 100 bytes typical). Every read or write under a row is atomic.
Column keys are grouped into column families. Unbounded number of cols in < 1000 families. Keys use syntax family:qualifier. Example here is language, storing multi-lingual websites in a row by URL and language:english, language:french columns.
Timestamps are just now() sent by the client. Applications deal with collisions.
Bigtable does provide some maid services for last versions, drop after days. Again, useful for caching webpages.
Clients can write, delete, read, read-range, atomic read-modify-write, batching writes, auto-incrementers, running client-supplied scripts in address space of servers. Plays nice with Sawzall [Pik+05], MapReduce [DG04].
SSTable is the file format. Persistent, ordered immutable map from keys to values. Internally, SSTable has 64 KB blocks. Block index is loaded into memory. Lookups are 1 disk seek. Binary search of in-memory index, go straight to block. SSTable can also be in memory.
Bigtable also relies on Chubby [Bur06] (log that never dies). Leases, appointing master, bootstrapping location of data, discover tablet servers, finalize tablet deaths, schema info, access control lists. Chubby dies 0.0047% of the time, 0.0326% worst example.
Master assigns tablets to tablet servers. Takes attendance, balances, garbage collects, schema changes, column family creations. Data does not go through master. Clients don’t often talk to master, master isn’t bottleneck.
As a table grows, it is automatically split into 100–200 MB tablets.
Chubby contains root tablet, root tablet contains (never split) METADATA tablets, pointing to tables. 3 levels is plenty.
Each tablet is assigned to one tablet server at a time. Master keeps track of live tablet servers. Master sends tablet load requests to tablet servers. New tablet servers sign up at Chubby, master watches this list. Tablet servers kill themselves if they get confused. Master detects if tablet server is no longer serving, and just switches to another. Master kills itself if Chubby stops responding, master failures do not change assignment of tablets to tablet servers.
METADATA table to learn about all tablets, populates unassignedMETADATA tablets because it has read root tablet.Created, deleted, two tablets merged to create one, or a split.
Splits are special because tablet servers commit to METADATA themselves.
Master can still learn about it even if commit is lost when it asks the tablet server what it has.