What is Big Data?
As IT and personal computing took over a mostly analogical world the amount of data generated by the impact is quickly growing to a size no one would’ve been able to predict. But when data is so big it becomes Big Data? And what exactly is Big Data? This article is a complete introduction to Big Data and related arguments such as Data Science.
What is Big Data?
Big Data is a buzzword, just like Cloud and Data Science, I always start explaining what a buzzword is by telling others what it is not. Being a buzzword Big Data won’t automatically solve complex problems you thought were too complex to solve, without efforts.
So Big Data is just like normal data, but big. Then when does data become big? Traditionally there are three characteristics (but there’s more) that define Big Data:
- volume: the amount of data is big, often times huge (over PB, EB scale)
- variety: the data contains information stored in different ways (structured, unstructured, semi-structured)
- speed: the speed at which the data is generated is elevated (e.g. real time sensors logs)
Why is Big Data important?
In the information age every information matters, the more the better. Big Data technologies shine in this field because they can handle huge amounts of data while allowing interrogations that would not be possible with traditional databases.
Thanks to Big Data it is possible to learn and understand new patterns hidden within data. These patterns can mean anything or nothing but most of the times even seemingly worthless data will hide precious information. As an example think of market sales and social media. One could envision algorithms that predict new sales looking at sales history or fashion trends based on social media posts.
Although Big Data may not be useful without skilled professionals and investments, the revenue generated by analyzing this data usually pays off in the long run. Big Data reached its popularity peak around 2015-2016, and while this is not the case anymore, most companies have embraced data collection and analysis.
Structured vs Unstructured Data (and semi-structured)
An important classification of data is the presence or the absence of a structure:
- structured data has a structure, most of the data in the past decades falls in this category, a prominent example is data stored within a Relational Database.
- unstructured data doesn’t have a structure, documents and emails are prominent examples.
- semi-structured data include parts that have a structure and parts that don’t.
While working with structured data has been traditionally handled by Relational Databases, dealing with unstructured data requires different techniques and usually more computational power. Most of the data generated and stored in the past decades was structured, but most of the data generated nowadays (think of IoT, Smartphones, Sensors) is unstructured. Big Data mostly works with unstructured data.
Big Data and Relational Databases
Relational Databases were the de facto structured data storage solution for business data. With the advent of Big Data, RDBMS (Relational Database Management System) couldn’t keep up with the growing amount of data, nor with unstructured data. The first form of non-relational data storage solution were the so-called NoSQL databases, allowing data to be stored in other structures different from tables.
The real leap only happened when Hadoop started gaining momentum. Hadoop is a framework composed of HDFS, a distributed file system capable of handling petabyte-scale data and MapReduce a programming model capable to execute complex interrogations across huge datasets.
Nowadays Big Data solutions often work alongside Relational Databases and Data Warehouses. The closest link between the two are the so-called Wide Column Stores sometimes also called NoSQL wide-column. These NoSQL databases resemble traditional Relational Databases but column names and their format may vary between rows.
Big Data and Data Science
Big Data has been often times compared to oil, holding intrinsic value. This value however is often inaccessible, that’s where Data Science comes in. Alongside traditional Data Mining techniques, Machine Learning and Artificial Intelligence algorithms help companies locate and leverage the value hidden within Big Data.
While not strictly required to achieve good results, Machine Learning algorithms will benefit from larger datasets. To manage datasets that are vast and unstructured technologies related to Big Data step in the process. Nowadays ML models are getting momentum which can lead to the belief the model is more important than data, but it is hard to envision a Machine Learning algorithm working on no data.
Big Data processes and related terms
The first term you may encounter is dataset. Traditionally datasets are associated with tabular form, that is not the case anymore and the term can also be used to refer to unstructured data.
Data Mining refers to techniques used to recognize patterns within data. Traditionally Regression Analysis has been used to perform this task, modern techniques such as Machine Learning have revolutionized the field.
Data Analysis refers to a process that aims to extract, clean, transform, model and sometimes visualize insights derived from data. While Data Mining analyzes data to discover patterns, Data Analysis extracts useful information from existing data.
Data Cleaning is the process that cleans the data from potential blocking situations such as incomplete values or duplicate entries.
Data Modeling refers to the process used to create a model that represents or can store data. This is often associated with RDBMS.
ETL stands for Extract, Transform, Load, it is a process that copies data from one or more sources (Extract) to a target system (Load) that represents the data differently from the sources (Transform). Another variant is E-LT or ELT. ETL is traditionally employed in Data Warehouses while ELT is more prominent in Big Data dealing with unstructured data.