Note on Fowler Nosql Distilled and SJSU CMPE 281 Course

for Cloud Technology course – 2018 Fall

The Fowler’s “NoSql Distilled” book is a good tutorial for anyone want to step into the new “polyglot persistence” world.  Before this course, I only focus on the relationship database (RDB), such as mssql, mysql, oracle, etc.  Actually, I did not attend the course lectures in the class. Instead, I follow the syllabus and the labs to finish the course study. I combine the book note and the course lab note here to make a brief tutorial for nosql database.

The Book Distilled Note

Let’s back to the book.  Below is my understanding about the book and the relative labs of the course.

The book is separated into 2 parts: first is the concept and the rise of the NoSql; the 2nd is the a little detail about each types of NoSql databases, such as Redis, Cassandra, MongoDB, Neo4j, etc.

  1. The rising of NoSql

I like the first part of the introduction of NoSql. The web and the big data make the call of the rising of open source NoSql, which is derived from Google BigTable and the Amazon Dynamo.  Gigantic companies have lots of money and developers to implement such a good non-relationship concepts into products.

The world is not only formed of relationship, but also most of them are aggregation.  For example, an order is an aggregation container of a customer profile, shipping address, order items, payment info, etc. Each of sub-items (such as order item) are be a good type of relationship data. But, sometime, we may need to consider them as a whole unit to handle. That is the transaction concept.  If you want to save all those sub-items into RDB tables, you need to form a transaction to either save all or nothing. But if we treat the whole aggregation as a unit in the nosql database, such as Cassandra, you do not need to using transaction. That is one of reasons of the rising of Nosql.

With the web matures, we got lots of data: those not only from web page, but also from the devices (internet of things), social network, etc. We need to handle lots of data for our application with huge amount of concurrent requests. Before the era of internet, we use RDB to store those data.  So, that raises a question for RDB: how to read and write data quickly. You can either scale up or scale out the RDB. It is unwise to use scale up strategy as data will grow and grow such that a server can’t handle it finally, and also a server will have big risk to failure down.  That means you can only use scale out strategy: the cluster.  The cluster in RDB will have the license cost per server issue and also the data storage is still only one in the center.  It still have same risk for the one storage failure, also the transaction read/write lock may impact the performance. That is the 2nd reasons of the rising of Nosql. With Nodql cluster, we will not use transaction lock to make sure RDB ACID (Atomic, Consistent, Isolated, and Durable). Instead we will use BASE (Basically Available, Soft state, Eventual consistency) strategy to make sure data consistency.  This will lead to the next note for consistency discussion.

  1. NoSql CAP (Consistency, Availability, and Partition tolerance) theorem

The CAP theorem declares and is proved that you can’t own all three of them for any nosql database.  Consistency is defined as the data updated should be read without stale data. Availability means if you can send request to the server, you can get read/write response from the server. Partition tolerance focuses on when clusters are separated into two or more un-communicated groups you can still get the service from the server.

In such definition, MongoDB is CP, as when partition happened, you can read the data from slave, however, you can’t write data to server if master is not elected.  The data is always consistent as data will only write to master, then sync to slave.

Cassandra is AP, as Cassandra network is a peer-to-peer ring, without master/slave concept. You can always get/write data from available peer with simple node configuration.  But the data may be stale as data is not consistent. Of course, you can use more restrict read/write configurations, such as Quorum/Majority read/write to make sure no-stale data is read.

No matter of which kind of CAP nosql database, you need to think about the read performance and write performance. We will discuss it in next note.

  1. Replica and Sharding

Replica is a good way to data disaster and improvement of read performance. As data is replica in the nodes of the nosql cluster, you will be safe if one node or two node fail. Also, as we can read data from replica nodes, it will dispatch the concurrent reads into each node and then improve the read performance.

Sharding is a good strategy to improve data write performance. Sharding means to partition data into different nodes in the cluster. For example, one node may contains customer data with Last Name from A to M. Another node may contains data from N to Z. So, when concurrent write request for different A customer and Z customer, it will be written independently by two different nodes.

Most of time, we will use both Replica and Sharding in the nosql. As we may read different/stale data from different nodes, we need to have data time stamp to use the latest data to update the stale data, the simple version technique is the counter number. If a record in one node have big number than other nodes, it will be the latest data.

  1. Four types of Nosql databses

The book classifies the nosql databases into four types: key-value, document, column-family and graph. That classification is made according to the data stored into database.  That is easy to understand by name for key-value, document. Column-family type can be treated as two-level aggregate structure. It was used by the rare-write-but-most-read case, such as google search. Graph type is used for the complex relationship data, such as friendship in the facebook. It has to use graph algorithm to traverse the node to make a complex query, such as who are the Fewler friends bought a book named “nosql distilled” that likes by President Obama.  

The real data in the world is diversity, so as the computer world. We can use different persistent/store strategies in one application. For example, we use key-value database to store the session data, document database to store order data, and graph database to store the customer/product relationship data so that we can use algorithm to recommend product to our customers.

Most of time, nosql database may store one type of aggregate. For example, we will store one order as an aggregate, which includes all sold products and prices. However, if we want to get the history sales of one product, you have to go through all orders to get the data. That will be time-consumed.

So, the book introduces Map-Reduce concept on how to build a materialized view, for example the products sale history view, which similar as the view concept in the RDB.

  1. Map-Reduce

Map means to get pairs of key-value pairs from an aggregation, such as generate all sold product and quantity pair in one order. Reduce means to summarize the same key data into one record, such as group those product-quantity pairs to the sum of each product, which means to get the sale view of each product.

We can utilize the parallelism in the map procedure as those key-value pairs are independent in each node. We can add intermediate step, such as combined reduce in the node, to accelerate the reduce procedure.

With the map-reduce framework, it is easy to build an incremental materialize view which reflect new coming data.

The Couse Lab Distilled Note

If you only read the book without exercise/lab, you will forget and do not get the detail. Here is a lab requests you to create AP MongoDB sharding/replica cluster and CP Cassandra cluster to test the difference of data availability and consistency when partition happened.

  1. MongoDB cluster in Amazon AWS

Thanks to the free tiers policy in the amazon AWS. It is not difficult to build the cluster with below web pages

here is the mongodb architecture diagram(got it from internet, credited to ??)

Below are those screen shots when create the mongodb sharding/replica clusters.

In this cluster, 3 replica is set for each set of data.

To try the partition test, you have to try disconnect the master. For a short period time, the two slaves will elect a new master. During this short period, you can’t write any data into the server as the master is down. However, it can read data from slave. So, the server is always consistent with sacrifice availability. That means MongoDb is AP.

  1. Cassandra cluster in Amazon AWS

Here is the architecture for the cluster

We will set the replica factor to 3 with 4 nodes. We set the consistency level as one node to read and write. This will make the Cassandra always available, as if you can see a node, you can always read and write even 2 other nodes are down.  However, it sacrifices the consistent, as the data between two nodes may have scenario that one is stale and one is latest. In this case, the Cassandra is AP nosql database.

Leave a Reply