Welcome to the first part of Distributed Thinking. Here I discuss the first part - When and Where. Please feel free to provide feedback in the comments! In case you missed anything, click here to go to Part 0.
1. Distributed Thinking
When I first started processing large amounts of data for a project I worked on which involved data from TV viewers (A lot of logs that set-top boxes throw up), I didn’t quite understand what “large” amounts of data meant. I had only worked on datasets involving a few thousand records, stretching up to a million at max.
When a colleague of mine wrote a solution to the project using Storm, I didn’t quite understand all the moving parts. I was still learning how to code better than the average fresh-out-of-college-computer-science-student, and everything looked complex and alien to me. I have since discovered that if you focus on something long enough, and tear it down into smaller parts, everything is easy! I hope to bring some of my understanding into this blog to make it easier for people new to this kind of data.
1.1 When to use it?
Suppose you have a data set that you need to perform some actions upon. Nearly all problems involving data can be broken into an
ETL process -
(Image drawn using Draw.io - Check it out, it’s awesome!)
Your data comes from a Data Store, or a Data Warehouse, gets transformed in some way using either Java, Scala or Python job running on one or more machines, and the output gets written into a Data Sink.
When the amount of data coming in from the Data Store is huge, and there’s a concentrated focus on making the
Transform part of the ETL run as smoothly as possible for the “big data”, tools such as Storm and Spark help you manage your computation in a distributed environment.
Do you have data that’s in the order of several GBs or TBs (or PBs) that has to be processed as soon as possible, maybe even instantly? That’s your cue to make your system fit into the ETL model, and write a distributed
Often systems like Spark and Storm take time to start up, along with the overhead of actually setting up a distributed cluster. If the advantages that distributed processing brings do not outweigh the disadvantages of the infrastructure and the small processing overhead that exists, you should probably reconsider - I have found that sometimes just a multithreaded implementation in Java was more than enough.
1.2 Where to start?
Now that you know when it makes sense to use a distributed system, the question is - Where do you begin?
Let’s take a step back and look at our
ETL system. It is divided into 5 parts - the
Data Store or
Data Warehouse (Let’s call it
Extract method (
Transform process (
Load method (
LD) and the
Data Sink (
DS ---(XT)---> TR ---(LD)---> SNK
Focus on each of these components now:
1.2.1 Data Store DS
The DS is the source. There can be several ways in which our system ingests the data. Let’s look at the most common ones:
- Relational Databases
- NoSQL Databases
- Data Streams
- Logs and Flatfiles
These data stores can be broadly split into two types of processing strategies:
- Batch - Databases and flat files
- Real-Time - Streams like Twitter Firehose, Kafka, Message queues, etc.
Traditionally, Spark is more suited for Batch processing, and Storm is suited for real-time processing, but both can do either type of processing, and what you actually end up using is a matter of preference.
1.2.2 Extract XT
Depending on the type of raw data your system has to process, there are a few strategies you can use to
Extract the data:
Use data as-is:
To do this, your
XTmethod would have a connector which connects to the DS and provides a cursor for the data, filters to clean up the data as it arrives from the raw source, and data parsers to convert the data into the structure expected by the
Transformmethods downstream. You would mix and match several parsers and filters here, trying to achieve the fastest, cleanest way to get the data to transform.
Some Data Stores might already provide clean data that can be used as-is, like Twitter or Kafka. If you need to process this in real-time, a Storm solution might make more sense.
Transform data into a smaller, or better local data store:
Your data might not arrive at convenient intervals, or it might make sense to only process your data periodically, and not as it arrives. If this is the case, it makes sense to read the raw data source when the data is made available, and store it locally to be processed when it’s time.
Your secondary store could be a smaller, and more curated (using methods mentioned in (1) above) as you might only want specific parts of the data. You can also use this process to store the data from a raw format such as Flat files and logs on FTP, into a better format, like a Database, or HDFS. A Spark solution might make more sense to process this data.
Now that the data has been extracted from the raw Data Store
DS, you can use it in the
Transform step using either
1.2.3 Transform TR
Now if you need a real-time solution,
Storm may be the way forward. If you need a batched processing solution,
Spark may be the way forward.
Pick whichever one you think would be faster for you to develop, and whichever is easier to maintain.
1.2.4 Load LD
The choice of your Data Sink
SNK would affect the way you write your
LD, but more on that later.
1.2.5 Data Sink SNK
Your final output gets written to the final form from which it’s possible to generate any kind of results you want:
- Report Generation
- Page Generation
- Data Insights
Your output might be to a database, or to files, or to a web service. We’ll discuss more about this later.
Click here to go to the next part - Part 2: How to look at data
Click on any of the links below to go directly to any part:
Part 0: Introduction
Part 1: When and Where
Part 2: How to look at data
Part 3: How to look at data (continued)
Part 4: Processing Data Streams
Part 5: Processing Large Data Chunks