What is Big Data?

by mark · Published 15 January 2020 · Updated 13 January 2020

As IT and personal computing took over a mostly analogical world the amount of data generated by the impact is quickly growing to a size no one would’ve been able to predict. But when data is so big it becomes Big Data? And what exactly is Big Data? This article is a complete introduction to Big Data and related arguments such as Data Science.

What is Big Data?

Big Data is a buzzword, just like Cloud and Data Science, I always start explaining what a buzzword is by telling others what it is not. Being a buzzword Big Data won’t automatically solve complex problems you thought were too complex to solve, without efforts.

So Big Data is just like normal data, but big. Then when does data become big? Traditionally there are three characteristics (but there’s more) that define Big Data:

volume: the amount of data is big, often times huge (over PB, EB scale)
variety: the data contains information stored in different ways (structured, unstructured, semi-structured)
speed: the speed at which the data is generated is elevated (e.g. real time sensors logs)

Why is Big Data important?

In the information age every information matters, the more the better. Big Data technologies shine in this field because they can handle huge amounts of data while allowing interrogations that would not be possible with traditional databases.

Thanks to Big Data it is possible to learn and understand new patterns hidden within data. These patterns can mean anything or nothing but most of the times even seemingly worthless data will hide precious information. As an example think of market sales and social media. One could envision algorithms that predict new sales looking at sales history or fashion trends based on social media posts.

Although Big Data may not be useful without skilled professionals and investments, the revenue generated by analyzing this data usually pays off in the long run. Big Data reached its popularity peak around 2015-2016, and while this is not the case anymore, most companies have embraced data collection and analysis.

Structured vs Unstructured Data (and semi-structured)

An important classification of data is the presence or the absence of a structure:

structured data has a structure, most of the data in the past decades falls in this category, a prominent example is data stored within a Relational Database.
unstructured data doesn’t have a structure, documents and emails are prominent examples.
semi-structured data include parts that have a structure and parts that don’t.

While working with structured data has been traditionally handled by Relational Databases, dealing with unstructured data requires different techniques and usually more computational power. Most of the data generated and stored in the past decades was structured, but most of the data generated nowadays (think of IoT, Smartphones, Sensors) is unstructured. Big Data mostly works with unstructured data.

Big Data and Relational Databases

Relational Databases were the de facto structured data storage solution for business data. With the advent of Big Data, RDBMS (Relational Database Management System) couldn’t keep up with the growing amount of data, nor with unstructured data. The first form of non-relational data storage solution were the so-called NoSQL databases, allowing data to be stored in other structures different from tables.

The real leap only happened when Hadoop started gaining momentum. Hadoop is a framework composed of HDFS, a distributed file system capable of handling petabyte-scale data and MapReduce a programming model capable to execute complex interrogations across huge datasets.

Nowadays Big Data solutions often work alongside Relational Databases and Data Warehouses. The closest link between the two are the so-called Wide Column Stores sometimes also called NoSQL wide-column. These NoSQL databases resemble traditional Relational Databases but column names and their format may vary between rows.

Big Data and Data Science

Big Data has been often times compared to oil, holding intrinsic value. This value however is often inaccessible, that’s where Data Science comes in. Alongside traditional Data Mining techniques, Machine Learning and Artificial Intelligence algorithms help companies locate and leverage the value hidden within Big Data.

While not strictly required to achieve good results, Machine Learning algorithms will benefit from larger datasets. To manage datasets that are vast and unstructured technologies related to Big Data step in the process. Nowadays ML models are getting momentum which can lead to the belief the model is more important than data, but it is hard to envision a Machine Learning algorithm working on no data.

Big Data processes and related terms

The first term you may encounter is dataset. Traditionally datasets are associated with tabular form, that is not the case anymore and the term can also be used to refer to unstructured data.

Data Mining refers to techniques used to recognize patterns within data. Traditionally Regression Analysis has been used to perform this task, modern techniques such as Machine Learning have revolutionized the field.

Data Analysis refers to a process that aims to extract, clean, transform, model and sometimes visualize insights derived from data. While Data Mining analyzes data to discover patterns, Data Analysis extracts useful information from existing data.

Data Cleaning is the process that cleans the data from potential blocking situations such as incomplete values or duplicate entries.

Data Modeling refers to the process used to create a model that represents or can store data. This is often associated with RDBMS.

ETL stands for Extract, Transform, Load, it is a process that copies data from one or more sources (Extract) to a target system (Load) that represents the data differently from the sources (Transform). Another variant is E-LT or ELT. ETL is traditionally employed in Data Warehouses while ELT is more prominent in Big Data dealing with unstructured data.

Images courtesy of Viktor Hanacek, mark | marksei, NEOSiAM 2020, Peter Heeling, Panumas Nikhomkhai and Kevin Fu

Author
Recent Posts

mark

The IT guy with a slight look of boredom in his eyes. Freelancer. Current interests: Kubernetes, Tensorflow, shiny new things.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_60468161_1	past	Set by Google to distinguish users.
_ga_DR9SCJ09BV	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.

Cookie	Duration	Description
edgebucket	session	Reddit sets this cookie to save the information about a log-on Reddit user, for the purpose of advertisement recommendations and updating the content.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	14 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
csv	2 years	No description available.
GoogleAdServingTest	session	No description
wp_api	past	No description
wp_api_sec	past	No description
_pk_id.1.95fa	1 year 27 days	No description
_pk_ses.1.95fa	29 minutes	No description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.

What is Big Data?

What is Big Data?

Why is Big Data important?

Structured vs Unstructured Data (and semi-structured)

Big Data and Relational Databases

Big Data and Data Science

Big Data processes and related terms

You may also like...

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials

What is Big Data?

What is Big Data?

Why is Big Data important?

Structured vs Unstructured Data (and semi-structured)

Big Data and Relational Databases

Big Data and Data Science

Big Data processes and related terms

Related posts:

You may also like...

Ethernet Cables: The complete guide

Top 10 Raspberry PI projects around the web to do today

Raspberry PI: credit-card sized computing

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials