BigTable, Google File System / Colossus
GFS and BigTable, the first two papers to describe a large scale, distributed file system or storage system, come from Google’s internal storage system serving indexing and many other application.
Google File System adopts a very straightforward master server to chunk server design and is implemented in user space. It partitions files into 64-MB chunks and stores copies of each chunks in several chunk servers. The master node, doesn’t necessarily store the actual chunks, but rather all the metadata associated with the chunks, e.g. the tables mapping chunk-label to chunk server, the location of copies. In GFS, read requests are forwarded to corresponding chunk servers and replied from chunk server to client directly for scalability. Write-request is a little bit complicated because the master node has to give one chunk server the permission of modification. This is handled by a “lease protocol”, in which the Master grants permission to a process for a finite period of time during which no other process can will be granted the permission.
GFS achieves great scalability and throughput: 100MB/s and even more for read request. However, it adopts a single serialized master-node design which could become the potential bottleneck of the whole system. This issue is address in following papers as I remembered. After all, as the authors claimed, making things simple is the most important lesson they learned.
BigTable is another Google internal product built on GFS and Chubby to provide a compressed, high performance and proprietary data storage system. It adopts a master server to tablets server design, where each tablet server takes care of 10 to 1,000 tablets (like pages) and master server maintains the metadata of tablet locations and handles (forwards) client request. Different from GFS’s single-master design, BigTable leverages Chubby to maintain a master-server in a group of 5 nodes. When a client request comes, the master node looks up for corresponding tablet server in one disk operation and forward the request to that tablet server. In addition, BigTable is able to recover from crash by reconstructing tablets from logs stored in GFS.
Hmm, a pretty good and brief design for large scale, distributed tables.
“In 2012, Google moved its search to a revamped file system known as Colossus. Whereas GFS was built for batch operations (i.e. happened in background before actually applied to a live website), Colossus is built for realtime services.” Another interesting thing here is that “Google has also dropped MapReduce but uses a new platform called Caffeine that operates more like a database”. One change in Colossus is that now they’re using multiple master nodes instead of the single one in GFS! Here’re some internal notes of Colossus.
- Next-generation cluster-level file system
- Automatically sharded metadata layer
- Data typically written using Reed-Solomon (1.5x)
- Client-driven replication, encoding and replication
- Metadata space has enabled availability analyses
Google developed a bunch of new storage systems or computation system but published none of them. From this document, we can see how quick the requirement is changing.
“HBase is an open source, non-relational, distributed database modeled after Google’s BigTable and is written in Java. It is developed as part of Apache Software Foundation‘s Apache Hadoop project and runs on top of HDFS (Hadoop Distributed Filesystem), providing BigTable-like capabilities for Hadoop. That is, it provides a fault-tolerant way of storing large quantities of sparse data.” – wiki
Tables in HBase serve as input and output for MapReduce jobs run in Hadoop and may be accessed by Java API. HBase is not a directly replacement of SQL Database because of performance. However, nowadays it’s serving data-driven websites including Facebook’s Messaging Platform.
Facebook moves to HBase+MySQL from Cassandra in 2010, but why it seems to be a better way to serve Facebook’s large data-driven website? Instant Messenger itself does not likely to require any large scale storage system but the message history does. Those could be quite simple structured queries like “SELECT message WHERE username=Alex”. In this scenario, Cassandra’s hybrid database won’t benefit much.
“In my opinion, these differing histories have resulted in HBase being more suitable for data warehousing, and large scale data processing and analysis (for example, such as that involved when indexing the Web) and Cassandra being more suitable for real time transaction processing and the serving of interactive data.” from Dominic’s Blog.
In my understanding, HBase provides a better throughput, fault tolerance, compression rate and locality, whereas Cassandra is doing better in latency. The former advantages could be achieved by leveraging a huge number of data servers, a better data partition strategy and replication strategy. To reduce latency, you need better caching strategy, fewer round-trip between master nodes and data servers.
Dynamo is a highly available key-value store service used in Amazon.com’s ecommerce operations. One of Dynamo’s authors, Avinash Lakshman, joined Facebook after the project and started the Cassandra project.
Different from previous key-value store system and databases, Dynamo sacrifices isolation and weakens consistency to guarantee high availability even in the event of failure, network partitions and errors. In addition, Dynamo provides the “always writable” service with extensive object versioning, and leaves conflict resolution to the read queries.
To achieve those goals, Dynamo leverages distributed hash table for the underlying architecture and guarantees eventual consistency instead of strong consistency. Instead of resolving conflicts during writes, Dynamo creates a new object (actually a new versioning of the object) for any write. Each object instance is associated with a vector clock which includes the information of where it comes from. During the read operation, the node merges all versions in *its* current knowledge base through *user-defined* functions. Object information is propagated through the p2p network to obtain eventual consistency.
Another interesting design of Dynamo is its “ring” partition schema. Each Dynamo instance has a “ring” space, where the storage hosts are randomly distributed on the ring (sometimes a storage host can map to multiple virtual nodes on the ring). Any data will be hashed into a position on the ring, and it will be stored at the first virtual node whose position is equal or larger than the hash value. The “ring” schema allows fast node insertion/deletion and failure handling. In addition, it’s very flexible.
However, the random “ring” schema is not a stable design intuitively. It’s possible that all replicas of a file all locate in the same data center. In the event of a entire database outage, you could loose all the copies of that file. On the other hand, if the underlying storage system guarantees to replicate the file in datacenter, this kind of data loss could be avoided.
As an Apache Software Foundation top-level project initially developed by Facebook, Apache Cassandra was described as a BigTable data model running on an *Amazon Dynamo-like infrastructure*. It’s used by Facebook to power their Index Search feature until 2010. Nevertheless, Cassandra is still continued to be used by Twitter but not Tweets.
It’s a structured key-value store with following features.
- It provides tunable consistency, which basically means the client application is able to determine the level of consistency for any given read/write operation.
Cassandra allows different level of consistency for the tradeoff between efficiency and correctness. This could be a good approach but leaving the per-query choices to application developers who are more likely to commit incorrect codes. In addition, it’s hard to predict the influence of queries with different consistency level on the same piece of data.
- Keys map to multiple values, which are grouped into column families. Column families are fixed when Cassandra is created but columns can be added to a family at any time. In addition, columns are added to specific keys, so different keys can have different number of columns in any given family. Values from a column family for each key are stored together. This is somehow similar to a row-oriented store.
Let’s think about how this feature can make a difference. If a SQL query requires at most one column in each column family, Cassandra performs quite like a traditional key-value store. But if a query requires multiple columns in a column family, e.g. join two columns, Cassandra is able to accelerate the computation because of better locality.
On the other hand, this data management strategy doesn’t make strong assumption for the number of columns so that it could be much easier for developers to misuse columns without a warning.
- Additional features include: using the BigTable way of modeling, eventual consistency, and the Gossip protocol, a master-master way of serving read and write requests inspired by Amazon’s Dynamo.