Big data

Big data has been a familiar buzzword the past couple of years. “Data is the new oil” is a constant refrain. So what exactly is it? What is the big deal? Why is it relevant?

Big Data is exactly what it sounds like. It’s data that is big, or rather of a large volume. It is characterized by data that increases in volume. A key aspect of Big Data is that it is simply impossible to analyse it using traditional hardware or software and extract meaningful data within a reasonable time.

Shopping patterns of customers from a particular eCommerce website is an example of big data. The browsing patterns of individuals of a particular demographic is another. Social media sites and stock market are huge sources of big data.

We’re living in a world where there is no scarcity for data. Almost every individual on the planet has a device with a whole array of sensors, allowing them to connect to the internet, communicate with their friends and purchase clothes for their anniversary celebrations. Each and every application on your smartphone generates a ton of data from their users revealing tiny bits of information about them.

And it is not just our smartphone device that collects information. With the advent of IoT, sensors generate data from just about any process, from logistics to agriculture, and from traffic patterns to self-driving cars. A single self-driving car currently generates around 25GB of data every hour, and this is expected to increase. May not seem like a big deal, but this translates to around 380 TB data every year.

So, in some ways, capturing the data is not much of a problem compared to storage and analysis of this data.

4 Vs of Big data.

Big data is characterized by 4 Vs, namely Volume, Velocity, Variety, and Veracity.

Volume

Obviously, the volume of data is a huge characteristic of big data. What makes big data big data is the sheer volume. The amount of data being generated is measured in exabytes (1 exabyte = 1048576 TB). And this volume is what makes makes it a problem as well as an opportunity.

Velocity

Velocity is the rate at which data is generated. A large volume of data is generated continuously, whether it’s user data from social media, or the number of tweets in a second(6000).

Variety

This is a major problem while handling Big data. The data that is collected is not just one type. There are different types of data and it is not easy to analyse this. Big Data can be classified into structured and unstructured. Structured data is relatively easy to analyse. They follow a specific pattern or a format, and computing systems have become pretty good at analysing this. For example, a table showing the data of employees, their salaries, age, etc is easy to understand, and visualise, and it is easy to draw meaningful conclusions.

But it is difficult to understand and analyse unstructured data. It is not easy to analyse the actions of a population on social media, particularly when it involves pictures, videos, or text all mixed up together.

Veracity

While data collection has somewhat become easier, there may be inherent errors in the collected data. Sensors are not always perfect, and the data may not be representative of the complete situation.

Benefits of Big Data

Optimisation is just one word to describe the benefits of big data. With more data, you’re able to predict more, which leads to better decisions and increased efficiency.

Large scale data will give accurate insights about the behaviour of a system, and this can be used to improve efficiency. Collecting data about the number of passengers in every aircraft flying every day to different destinations, combined with weather patterns and diversions can help airlines plan the routes more efficiently. Similarly, data from assembly lines can help understand where the bottlenecks are as well as give insights into improving the situation.

Decisions based on large data sets tend to give better results. Simply put, this is because we will gain a better understanding. This has been proven many times in the case of drug testing. Testing conducted across a large and diverse group will give us a better idea about the efficiency and side effects of a drug. Large amounts of data for a particular application make it possible to examine multiple cases of just about every possibility and scenario and come to a conclusion. This is the same reason why Tesla has an advantage over other self-driving car companies: their cars are collecting huge amounts of data every second every day, coming across even the edge cases.

Another benefit is that of cost savings. Consider the case of maintenance in a manufacturing unit. One way to do is to carry out routine maintenance at regular intervals. But this will lead to unnecessary delays when the production is interrupted. In this case, maintenance is carried out even if it’s required or not. Another way to do this is to carry out maintenance when it breaks down. This may result in expensive repairs and costly shutdowns. Now if you can predict when to carry out the maintenance, that should be the perfect solution. And this can be done using big data.

Another example of cost-saving would be that of using warehouses for eCommerce. For quick delivery of goods, it is important that the goods are already stored near the potential customer. But storing the entire catalogue of products in large quantities in multiple locations will be expensive. Instead with Big Data analysis, you can predict the products and their quantities that have to be maintained in different warehouses.

Technologies used to handle big data

For storage: Hadoop and Mongo DB

MongoDB is a distributed NoSQL database. It is document oriented and stores data in a JSON like format, and is often used as a database for big data processing.

Hadoop is a collection of software for storing and processing large amounts of data using a network of computers The main components of Hadoop are Hadoop Distributed File System(HDFS) which is the storage part of Hadoop and Hadoop Yarn which is responsible for allocating computing resources.

Data mining: Presto and RapidMiner

Presto is a query engine for Big data. It is open-source, high performance and works with Hadoop and MongoDB. Originally designed at Facebook, it is currently a community-driven open-source software.

RapidMiner was developed around 2001 as YALE or Yet Another Learning Environment and is used for deep learning, text mining, and predictive analytics.

Visualisation and analysis: Tableau, Plotly

Visualization tools like Tableau make it easier for even laypersons to understand the data, and draw meaningful conclusions from it which may be used later for decision making. Plotly is another tool commonly used for Big Data visualization and analytics.