Data Science 101

Hope y’all doing well. The format of this blog post is going to be similar to a blog post that I did on a different topic called Internet Radio Station 101 where I went through the workings of an Internet Radio Station from the top-down.

I recently completed the Introduction to Data Science eDX course by IBM which is a course that gives you an overview of what data science is, and what data scientists do in their day-to-day lives. It targets people who don’t have previous data science experience. As a person who started off with Android Development, and slowly transitioned to Hybrid app development, I’ve mainly focussed on the user experience side of things, and not much on the science part of the things.

I mean if you really want to create something that matters as a software engineer, you will need to create an app right??

Well, that might not always be the case.

As you will see in this post, there are many other things that you can do using say data science, that will not need any end-user application to be developed by you.

That being said, things like Machine Learning, Big Data, Deep Learning, Data Visualization, Data Analytics, Classification, Regression all sound like interesting stuff, but they really don’t make all that sense if you don’t actually know what they are and what they are used for. Well, that was my case.

Therefore in this post, I will be giving an overview of what data science is, different paths to data science, skills that are needed to work in data science, and different fields within data science, etc… So without further ado let’s begin!

What is data science?

Data Science is a process, not an event. It is the process of using data to understand different things, to understand the world.

Data science is the art of uncovering the insights and trends that are hiding behind data. It’s when you translate data into a story. So use storytelling to generate insight. And with these insights, you can make strategic choices for a company or an institution.

IBM Introduction to Data Science (eDX)

To put it in my own words, data science is where you – work with data, analyze it, put it in a graph, transform it, send it through different processes, and finally get insights and conclusions based on the resulting data.

This is cool in a way since you are actually working with real raw stats that come from different kinds of systems, which you need to use to come up with a theory that solves real-world problems. In a nutshell, it is the Sherlock Holmes of the data world.

The question remains then of what fields come under data science. Is it data science if you don’t work on Terabytes worth of data? Do things like speech recognition and image recognition come under data science as well. The answer to both these questions is yes. It is such a broad field that it encompasses fields that we might not think would come under data science.

Big Data

In this digital world, everyone leaves a trace. From our travel habits to our workouts and entertainment, the increasing number of internet-connected devices that we interact with on a daily basis record vast amounts of data about us.

So does the Big in Big Data refer to just the size of the data being stored here? Well, that is just one of the 5 aspects. In Big Data we have what is known as 5 Vs of Big Data. They are:

  1. Velocity is the speed at which data accumulates.
  2. Volume is the scale of the data or the increase in the amount of data stored.
  3. Variety is the diversity of the data. This might refer to structured data with tables, columns, and rows, and unstructured data like Tweets, blog posts, pictures, music, and videos. Variety also reflects that data comes from different sources (i.e. machines, people, and processes).
  4. Veracity is the quality and origin of data and its conformity to facts and accuracy. Attributes include consistency, completeness, integrity, and ambiguity.
  5. Value is our ability and needs to turn data into value. The main reason that people invest time to understand Big Data is to derive value from it.
A Popular Tool used for Big Data:

The first tool that comes to mind is Hadoop which is for a reason since it is popular for doing one thing right, which is to implement the MapReduce programming model to distribute the processing of large datasets. Hadoop is also less of AI and more of Cloud Computing. It is therefore something that enables the process of analysis but is not purely the analysis itself.

The term Big Data refers to data sets that are so massive, so quickly built, and so varied that they defy traditional analysis methods such as you might perform with a relational database.

Machine Learning and Deep Learning

These are also popular buzz words in the IT world and are closely related to data science. Both of these terms are more related to AI rather than analytics which is the basis for Big Data. One question is, what is the difference between these two?

Machine learning is a subset of AI that uses computer algorithms to analyze data and make intelligent decisions based on what it has learned, without being explicitly programmed.

On the other hand

Deep learning is a specialized subset of Machine Learning that uses layered neural networks to simulate human decision-making.

So basically, Deep learning is just Machine learning with neural networks. What is a neural network you ask? It is a set of algorithms that work to mimic the operation of the human brain. It is basically a system that takes incoming data and learns to make decisions over time.

The thing to note here is that AI is not the same as data science, and one is not the subset of the other either. Instead, they are separate fields that have an intersection where one uses the tools from the other to perform its own goals.

There is some intersection between AI and data science, but one is not a subset of the other. Rather, data science is a broad term that encompasses the entire data processing methodology. While AI includes everything that allows computers to learn how to solve problems and make intelligent decisions.

A Popular tool used for Machine Learning in data science:

A tool that comes to mind is TensorFlow. It allows the creation of a machine learning model that basically inputs the source data and does some processing and gives an output. This includes things like clustering or classification of images, videos, text into different groups.

Data Science in Business

Alright, so you have these different technologies that helps you to do research and get insights on some real-world happenings. But what value do businesses get from it, and is any of that value monetary?

Data Science and Big Data are making an undeniable impact on businesses, changing day-to-day operations, financial analytics, and especially interactions with customers

To give you an example, most online retailers like Amazon, Newegg, Flipkart, Snapdeal, eBay all have a recommendation engine that works using Big Data and Machine Learning combined. This provides monetary gain to them since they get to know about the customer’s interests based on the past activity, and this helps them to display to the customer the kind of products that they are likely to purchase.

Another example of a business use case is Netflix, which makes use of its users’ watch activity to get an insight of what movies and TV shows work and what don’t. This helps them to make informed decisions and investments such as purchasing exclusive licenses and sometimes even entire studios.

Data Science as a Career

To be honest, I’m not the best person to talk about career development in Data Science, since I’m first of all quite young, and secondly have not worked on the Data Science field. That being said, I have in fact interacted with people who have worked in Data Science, and from what I’ve seen it looks like it is rapidly rising and in high demand.

The main reason for this would be due to the exponential increase in the amount of data being created and stored in the world every single day. Within twenty years we’ve gone from 5 MB memory cards (Playstation 1) to 1 TB SSD expansion cards (Xbox Series X), and the difference in data on the internet is even more dramatic due to the presence of the cloud.

Because of all this, we don’t see data science as a field dying anytime, at least until the time when data ceases to exist (not anytime soon).

That’s all for today. Meet y’all again in another 101 post.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s

This site uses Akismet to reduce spam. Learn how your comment data is processed.