An Introduction to Data Science

ยท

2 min read

Exploratory Data Analysis

The first step in any project is exploring the data. Classical ways to analyze data are based on statistical inference, meaning we take small samples out of large populations but the subject data analysis was the brainchild of John Tukey. He was the sole inceptor of the most used analysis used by many MNCs today called "Data Analysis".

Here the statistical inference was just one part of the analysis. The tenets of his works formed the basis of data science as we know of today, as computers become faster and languages become more syntactically, and semantically more accessible we are in the age of data science with data analysis as its roots.

Elements of Unstructured Data

Data is obtained from many sources, here our focus is to identify some patterns in our data so understanding what comprises of our data is essential. The data mostly comes under these formats :

  1. Images: A stream, a matrix of pixels comprising an image and all of these images need to be analyzed to obtain a better inference.

  2. Text: The vast majority of the data present and the oldest format of data after the advent of the internet is textual data, so inferring from it is crucial.

  3. ClickStreams: One of the most unnoticed kinds of data is called clickstreams, these are the order in which we click on stuff on a webpage or a specific piece of software.

One of the challenges of data analysis is to change this torrent form of data into a structured form of data. When we say structured we mean to say the relational database kind of, one comprising of tables and rows.

Types of data

There are two types of data continuous and categorical. Continuous data may be such as the price of a house or a good whereas categorical data is a type of data that is discrete meaning that it will harbor only the kind of data that is discrete i.e. yes/no, multi-classification, LED-sensors etc.

Why do we need types?

We bother with the taxonomy of data because it determines the visual display, data analysis and predictive analysis. Popular and powerful software such as Python and R also need to have clarity as to which data type are they serving their purpose.

Did you find this article valuable?

Support Sohail by becoming a sponsor. Any amount is appreciated!

ย