Suche Menü

Qudosoft goes Big Data Part 4

„Qudosoft goes Big Data“ is an ongoing series of blog posts in which we will share our experience, the first steps we took and the best practices we found for ourselves for our use cases.
In previous posts we already discussed Spark, Hadoop and his ecosystem, Cluster and Software and we will continue with a short blog post about Cassandra.


Currently, we are building a data-hub for our parent company. Since it is hard to anticipate how the average size of your incoming data will increase, one of the most important requirements for a data-hub is scalability. ‚Volume‘, one of the 4 V’s of Big Data, adresses this issue of undefined possible increase of incoming data. A possible solution to be able to persist incoming data at scale is Cassandra.

Cassandra is a NoSQL database build at Facebook in 2008 for handling large volumes of data by providing the possibility of linear scalibility and is now a Apache Project. Cassandra tries to satisfy the CAP-theorem but is optimized for writing and availability. It expects that nodes in your cluster will fail. You may set the consistency-level yourself with different levels for your data as you need it.

Cassandra itself is build as a ring-cluster. You don’t have a master-slave architecture as fairly common in the hadoop eco-system, enabling you to circumvent having a single point of failure. It is easy to add new Cassandra instances to your existing cluster and the new instance will be integrated via ring-communication. Sharding of the data is handled automatically when a new instance is added or an instance fails. At first glance, it may look like a relational database but it is definitely not comparable to a relational database since it is column-oriented. This means that you can add and delete columns without problem but it will be associated to your primary-key (the so-called row-key). You have to imagine it like a very big table as it follows Google Bigtable and Amazon DynamoDB. With a big table I mean a very big key value store where you can have many values for a key associated to a certain column. All inserted values are stored with their insertion timestamp so you automatically have a versioning of values for columns. Cassandra allows you to define multiple datacenters. A datacenter is a cluster/ring-system that has its own configuration and can be used for example only for insertions. Now you could think of a second datacenter which does not need the fault tolerance like your first datacenter and has a different configuration with a different replication-factor to be optimized for reading.

Cassandra has its own query language called CQL which looks very similar to SQL but be aware of some very different restrictions on the schema used. For example, you cannot query fields which are not in your primary key and not indexed as secondary indexes. Primary keys can consist of partitioning keys and clustering keys. Partitioning keys represent the sharding of your data and clustering-keys influence the sorting of your data. You have to query  all fields defined in your primary key as partitioning keys. Secondary indexes cannot be queried for ranges like greater or less than but only for equality. For a better understanding of Cassandra’s data model, I recommend having a look at the self-paced course by Datastax.

As you may realize, you have to define your schema very well to support the queries you wish to execute on your data. You very likely will have to create multiple tables to allow all queries you want to execute on the same data. A best practices seems to be knowing your data and the queries you will execute on your data for modeling your schema.

For huge incoming data Cassandra seems to be the right choice but for doing analytics you should consider using Spark. Some researches present a multiple-datacenter-solution where one datacenter handles all the writes of multiple clients and the data is replicated to a second datacenter which only allows reading of the data for analytic purposes. This second datacenter could allow analytics with Spark.

Schreibe einen Kommentar

Pflichtfelder sind mit * markiert.

Agile Softwareentwicklung