What is Data Science?

by mark · Published 8 January 2020 · Updated 8 January 2020

Data science is probably going to be one of the hottest IT trends in the current decade, but what exactly is Data Science? How does it relate to Machine Learning? And how does Big Data fit into it? Do you need to be a programmer to become a Data Scientist? And what is a Data Scientist in the first place? To answer this and many other questions is the main focus of this article.

What is Data Science?

Data Science is a buzzword, just like Cloud and Big Data, I always start explaining what a buzzword is by telling others what it is not. Being a buzzword Data Science won’t automatically solve complex problems you thought were too complex to solve, without efforts.

Data Science is a concept created to represent a field that employs statistics, data analysis and machine learning to solve problems. Contrary to popular belief Data Science is not (yet?) a form of science nor a branch of Computer Science or Artificial Intelligence. As a matter of fact Data Science uses statistics and machine learning to find and prove correlation (and infer causality) between data with varying degrees of confidence.

So what kind of problems does Data Science try to solve? Let’s make an example to make it clear. Imagine you own a real estate business that spans across multiple cities. A problem you will surely face is: “How much can I sell X property so that I can maximize my profit?”. Given that the problem depends on many factors such as market conditions, the location of the property and many many others, the “old” solution was to ask an expert. Experts can appraise the value of the property (usually deriving from similar, previous, sales: experience), they factor in as many variables (markets, locations), then they factor their experience as a whole and produce a price (answer to the problem). The problem with this approach is that it is highly subjective, many experts will give many different answers. Data Science can solve this problem using Machine Learning and a large enough set of data, becoming a “digital expert”.

Data Science and Machine Learning (AI, ML, DL)

The principal tool of Data Science is Machine Learning with Deep Learning being still explored as a mean to solve Data Science problems. Machine Learning uses statistical methods and computers to compute algorithms that learn from data. This concept can be compared to experience. A ML algorithm can include many previously-known models, and the resulting algorithm is a new model. As the model is trained against data it acquires experience, and once its training is over the model can make predictions. There is quite a bit more to this process than this, but let’s leave it to another time.

Returning to the previous real estate example, you could hire a Data Scientist to develop a model that will make predictions based on the company’s data. The more data your company has acquired the more the model will benefit. The Data Scientist will use the data produced by your company to create and train a ML model that will need to be as precise as possible in order to optimize the results, and ultimately, your profit.

Machine Learning deals with many kinds of problems, the two most common ones are Classification Problems and Regression Problems.

In Classification Problems you have many classes (depending on the problem) and many elements belonging to those classes, then a new element with an unknown class comes in and you wonder “What class does this element belong to?“. That’s a Classification Problem.

In Regression Problems you don’t have classes, instead you have a dependent variable and many variables that may or may not influence it. In the precedent example the price was the dependent variable and some variables might be the location, the age of the property. Predicting the price based on such variables is a Regression Problem.

Data Science and Big Data

If Machine Learning is the principal tool of Data Science, data is the raw material. Algorithms are tied to physical machines resources, and ML models are no exception. On top of that most of the times Data Science deals with unstructured data rather than structured data. This leads us to Big Data.

Big Data is the field that deals with data that is too big and too diverse, which grows too fast and would be too slow to access in a timely manner. While real estate data owned by a company may not be that big, imagine thousands of sensors distributed across a car factory and the volume of data they produce in a single process.

While not strictly required to achieve good results, ML models will benefit from larger datasets. To manage datasets that are vast and unstructured technologies related to Big Data step in the process. Nowadays ML models are getting momentum which can lead to the belief the model is more important than data, but it is hard to envision a model working on no data. As a matter of fact it is currently accepted that data has intrinsic value and all the processes performed by Data Analysis (e.g. Data Mining) only aim to discover the hidden value of data.

A Data Scientist and its skill set

“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”Josh Wills

I think the quote above is pretty explanatory of what a Data Scientist is, and should be. A good Data Scientist will need to understand mathematics and have an in-depth knowledge of statistics, especially Bayesian statistics and Statistical Inference.

On top of being skilled in statistics and mathematics, a good Data Scientist must also be a master of computer programming. While not strictly required, developing a ML model without “traditional” programming experiences is very unlikely to happen.

To help the Data Scientist in its journey there are many programming languages and frameworks, it is essential for the Data Scientist to be able to proficiently work with at least one of both. Nowadays the three most popular languages for Data Science are Python, R and Scala. Popular frameworks are Tensorflow, Scikit-learn and Spark MLlib.

Another nice-to-have skill is the ability to manage data at scale, although this is more of a “Data Engineer” task. Having concepts such as distributed computing/file systems, horizontal scaling and pipelines will surely help the Data Scientist be more productive and efficient in his work.

Image courtesy of Viktor Hanacek

Author
Recent Posts

mark

The IT guy with a slight look of boredom in his eyes. Freelancer. Current interests: Kubernetes, Tensorflow, shiny new things.

Cookie	Duration	Description
CONSENT	2 years	YouTube sets this cookie via embedded youtube-videos and registers anonymous statistical data.
_ga	2 years	The _ga cookie, installed by Google Analytics, calculates visitor, session and campaign data and also keeps track of site usage for the site's analytics report. The cookie stores information anonymously and assigns a randomly generated number to recognize unique visitors.
_gat_gtag_UA_60468161_1	past	Set by Google to distinguish users.
_ga_DR9SCJ09BV	2 years	This cookie is installed by Google Analytics.
_gid	1 day	Installed by Google Analytics, _gid cookie stores information on how visitors use a website, while also creating an analytics report of the website's performance. Some of the data that are collected include the number of visitors, their source, and the pages they visit anonymously.
__gads	1 year 24 days	The __gads cookie, set by Google, is stored under DoubleClick domain and tracks the number of times users see an advert, measures the success of the campaign and calculates its revenue. This cookie can only be read from the domain they are set on and will not track any data while browsing through other sites.

Cookie	Duration	Description
edgebucket	session	Reddit sets this cookie to save the information about a log-on Reddit user, for the purpose of advertisement recommendations and updating the content.
IDE	1 year 24 days	Google DoubleClick IDE cookies are used to store information about how the user uses the website to present them with relevant ads and according to the user profile.
test_cookie	14 minutes	The test_cookie is set by doubleclick.net and is used to determine if the user's browser supports cookies.
VISITOR_INFO1_LIVE	5 months 27 days	A cookie set by YouTube to measure bandwidth that determines whether the user gets the new or old player interface.
YSC	session	YSC cookie is set by Youtube and is used to track the views of embedded videos on Youtube pages.
yt-remote-connected-devices	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt-remote-device-id	never	YouTube sets this cookie to store the video preferences of the user using embedded YouTube video.
yt.innertube::nextId	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.
yt.innertube::requests	never	This cookie, set by YouTube, registers a unique ID to store data on what videos from YouTube the user has seen.

Cookie	Duration	Description
csv	2 years	No description available.
GoogleAdServingTest	session	No description
wp_api	past	No description
wp_api_sec	past	No description
_pk_id.1.95fa	1 year 27 days	No description
_pk_ses.1.95fa	29 minutes	No description
__smSessionId	9 hours	No description available.
__smToken	1 year	This cookie is set by the Sumo. This cookie is used for verifying whether the user is logged in or not.

What is Data Science?

What is Data Science?

Data Science and Machine Learning (AI, ML, DL)

Data Science and Big Data

A Data Scientist and its skill set

You may also like...

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials

What is Data Science?

What is Data Science?

Data Science and Machine Learning (AI, ML, DL)

Data Science and Big Data

A Data Scientist and its skill set

Related posts:

You may also like...

2020: Trends and predictions for technology and IT

What is the Cloud? PAAS, SAAS and IAAS explained

Y2038 and Millennium Bug, analysis of a disaster

Leave a ReplyCancel reply

Recent Posts

Recent Comments

Categories

Latest tutorials