What is Data Science?
Data science is probably going to be one of the hottest IT trends in the current decade, but what exactly is Data Science? How does it relate to Machine Learning? And how does Big Data fit into it? Do you need to be a programmer to become a Data Scientist? And what is a Data Scientist in the first place? To answer this and many other questions is the main focus of this article.
What is Data Science?
Data Science is a buzzword, just like Cloud and Big Data, I always start explaining what a buzzword is by telling others what it is not. Being a buzzword Data Science won’t automatically solve complex problems you thought were too complex to solve, without efforts.
Data Science is a concept created to represent a field that employs statistics, data analysis and machine learning to solve problems. Contrary to popular belief Data Science is not (yet?) a form of science nor a branch of Computer Science or Artificial Intelligence. As a matter of fact Data Science uses statistics and machine learning to find and prove correlation (and infer causality) between data with varying degrees of confidence.
So what kind of problems does Data Science try to solve? Let’s make an example to make it clear. Imagine you own a real estate business that spans across multiple cities. A problem you will surely face is: “How much can I sell X property so that I can maximize my profit?”. Given that the problem depends on many factors such as market conditions, the location of the property and many many others, the “old” solution was to ask an expert. Experts can appraise the value of the property (usually deriving from similar, previous, sales: experience), they factor in as many variables (markets, locations), then they factor their experience as a whole and produce a price (answer to the problem). The problem with this approach is that it is highly subjective, many experts will give many different answers. Data Science can solve this problem using Machine Learning and a large enough set of data, becoming a “digital expert”.
Data Science and Machine Learning (AI, ML, DL)
The principal tool of Data Science is Machine Learning with Deep Learning being still explored as a mean to solve Data Science problems. Machine Learning uses statistical methods and computers to compute algorithms that learn from data. This concept can be compared to experience. A ML algorithm can include many previously-known models, and the resulting algorithm is a new model. As the model is trained against data it acquires experience, and once its training is over the model can make predictions. There is quite a bit more to this process than this, but let’s leave it to another time.
Returning to the previous real estate example, you could hire a Data Scientist to develop a model that will make predictions based on the company’s data. The more data your company has acquired the more the model will benefit. The Data Scientist will use the data produced by your company to create and train a ML model that will need to be as precise as possible in order to optimize the results, and ultimately, your profit.
Machine Learning deals with many kinds of problems, the two most common ones are Classification Problems and Regression Problems.
In Classification Problems you have many classes (depending on the problem) and many elements belonging to those classes, then a new element with an unknown class comes in and you wonder “What class does this element belong to?“. That’s a Classification Problem.
In Regression Problems you don’t have classes, instead you have a dependent variable and many variables that may or may not influence it. In the precedent example the price was the dependent variable and some variables might be the location, the age of the property. Predicting the price based on such variables is a Regression Problem.
Data Science and Big Data
If Machine Learning is the principal tool of Data Science, data is the raw material. Algorithms are tied to physical machines resources, and ML models are no exception. On top of that most of the times Data Science deals with unstructured data rather than structured data. This leads us to Big Data.
Big Data is the field that deals with data that is too big and too diverse, which grows too fast and would be too slow to access in a timely manner. While real estate data owned by a company may not be that big, imagine thousands of sensors distributed across a car factory and the volume of data they produce in a single process.
While not strictly required to achieve good results, ML models will benefit from larger datasets. To manage datasets that are vast and unstructured technologies related to Big Data step in the process. Nowadays ML models are getting momentum which can lead to the belief the model is more important than data, but it is hard to envision a model working on no data. As a matter of fact it is currently accepted that data has intrinsic value and all the processes performed by Data Analysis (e.g. Data Mining) only aim to discover the hidden value of data.
A Data Scientist and its skill set
“A data scientist is someone who is better at statistics than any software engineer and better at software engineering than any statistician.”Josh Wills
I think the quote above is pretty explanatory of what a Data Scientist is, and should be. A good Data Scientist will need to understand mathematics and have an in-depth knowledge of statistics, especially Bayesian statistics and Statistical Inference.
On top of being skilled in statistics and mathematics, a good Data Scientist must also be a master of computer programming. While not strictly required, developing a ML model without “traditional” programming experiences is very unlikely to happen.
To help the Data Scientist in its journey there are many programming languages and frameworks, it is essential for the Data Scientist to be able to proficiently work with at least one of both. Nowadays the three most popular languages for Data Science are Python, R and Scala. Popular frameworks are Tensorflow, Scikit-learn and Spark MLlib.
Another nice-to-have skill is the ability to manage data at scale, although this is more of a “Data Engineer” task. Having concepts such as distributed computing/file systems, horizontal scaling and pipelines will surely help the Data Scientist be more productive and efficient in his work.
- 2020 A year in review for Marksei.com - 30 December 2020
- Red Hat pulls the kill switch on CentOS - 16 December 2020
- OpenZFS 2.0 released: unified ZFS for Linux and BSD - 9 December 2020