Data Science Techniques and its Usage in Real Life Applications

Summary:

Data Science is an umbrella term used to define anything and everything that deals with the science of analyzing or exploring the data. Even though Data Science as a field comprises of multiple branches, in recent years it has become synonymous with the most famous branch of all namely Artificial Intelligence and Machine Learning. The Data Science techniques can range from applying simple exploratory analysis on top of data to applying Generative AI models on top of data and beyond. 

In this paper we deal with some of the most commonly used Data Science Techniques and how they can be used to solve some real-life problems. It will also simplify some of the most commonly used techniques in ML and AI and identify the kind of problems that can be solved using these techniques.

We try not to delve deep into the math behind these algorithms and focus on practical applications behind the techniques as it would deviate us from the core objective of this paper. We have structured this paper to primarily focus on solving the problems in marine and shipping industry as an industry of choice. 

Demystifying Data Science:

With all the hype around Artificial Intelligence and Machine Learning, we thought it would be appropriate to start demystifying what’s actually behind all this hype and help clarify what’s happening behind the scenes. Data Science at its core is nothing but a clever application of some math (namely Algebra) and stats to identify some hidden insights and patterns within the data that are not obvious to the user at the first glance. The math applied can range from simple theorem of a Naïve Bayes Classifier (Bayes Theorem on Conditional Probability) to the complex math of Deep Learning (Chain Rule on Calculus). Almost all Machine Learning algorithms follow the following flow chart.

  1. Formulate the problem statement into some kind of equation
  2. Find the parameters of the equation by running some kind of constrained optimization on top of the problem statement equation with the help of training data. In simple terms find the roots of the equation with the help of training data.
  3. Apply the parameters learnt from step 2 on the Test data
  4. Quantify the error on the Test Data, adjust the Parameters accordingly and repeat from step2

The above steps apply to the whole host of Machine Learning algorithms, with a slight twist in case of Deep Learning. With Deep Learning, the equation itself is learnt while training the model. This might seem confusing at first, but we will get more clarity as we go through the paper.

Machine Learning algorithms has been classified into the following major sub groups

  1. Unsupervised Learning
  2. Supervised Learning
  3. Reinforcement Learning

Unsupervised Learning:

Unsupervised Learning refers to a class of Machine Learning algorithms that learns patterns from the data that are not labelled or classified by the user. In other words, Unsupervised Learning algorithms identify patterns in the data on their own without any human intervention. 

Fast Fourier Transforms (FFT):

When it comes to Unsupervised Learning, most people would start with techniques such as Clustering Algorithms or Feature Engineering algorithms such as PCA. But we consciously made a decision to start with FFT’s. Not many people would have used FFT in their analysis. However, FFT is one of the most important algorithms used especially in signal processing and data analysis. 

On a high level FFT is used to transform a signal or time series data from time domain to frequency domain. The Fourier transform takes in data and gives out the frequencies that the data contains.  This technique is especially useful in analyzing the time series data from sensors fitted across different equipment’s onboard the vessels. 

Clustering:

Clustering is one of the simplest and most useful techniques in Machine Learning. It deals with grouping of members of a population into different clusters such that population within a cluster have properties that are very similar to each other. There are two broad types of clustering. 1) Hard Clustering where each data point exclusively belongs to one cluster 2) Soft Clustering where a data point can belong to multiple clusters at the same time.

One of the most important clustering algorithms is K-Means clustering where the data point is divided into K different clusters. The clusters are initialized randomly and they are refined iteratively to obtain the final set of clusters. Even though clustering as a standalone algorithm has limited usage, this algorithm when combined with other machine learning algorithms in an ensemble model has wide variety of applications such as personalized recommendation engines, Anomaly detection and Outlier removal etc. 

Dimensionality Reduction Algorithms:

Dimensionality Reduction algorithms refers to class of algorithms that is used to reduce the numbers of features or dimensions in the input data. The reduced dataset still holds majority of information of the original dataset. This set of algorithms is especially important as we are flooded with lot of information. These algorithms help us address the curse of dimensionality which states as dimensionality (the number of input variables) increases, the volume of space grows exponentially resulting in sparse data. Some of the most famous Dimensionality Reduction Techniques are as follows

  1. Principal component Analysis – An algorithm that uses Eigen values and Eigen vectors to reduce the dimensions in the data
  2. Linear Discriminant Analysis – which expresses a variable as linear combination of other variables thereby bringing down the number of features

Dimensionality reduction techniques has wide range of applications. Basically, this technique is applicable whenever we have large dataset with hundreds or thousands of columns. Some of practical applications of the algorithms include

  1. Data Compression and Decompression
  2. Image Analysis
  3. Weather Data Analysis etc

Supervised Learning:

Supervised Learning refers to the class of Machine Learning algorithms that deals with labelled Data. Here the data is divided into train and test sets where the training dataset is used to learn the modelling function that best approximates the behavior in the data and test data set is used to reasonably determine the effectiveness of the model.

Regression:

Regression models are used to determine the relationship between a set of variables, by fitting a linear or non-linear curve to fit the data. Regression allows you to quantify the amount of change in a variable with respect to unit change in another variable. The simplest type of regression is a Simple Linear Regression which allows us to fit a straight line between two quantitative variables. 

Even though the Regression analysis might seem simple at first, the topic is huge with multiple variations of Regression starting with different Regularization techniques to different curve fitting techniques. Covering the different aspects of Regression is beyond the scope of this paper. 

Regression models are used in wide variety of scenarios, that makes it difficult to list them all. Any application that involves finding a relationship between a set of numerical variables would be a good fit for regression. Some of the real-life examples of Regression are listed below

  1. In Shipping, Ship Operators frequently use non-linear regression to understand the relationship between Ship speed and Fuel Consumption
  2. In healthcare, Linear regression is often used to determine the relationship between drug dosage and blood pressure.
  3. In Finance it is used as CAPM (Capital Asset Prediction Model) and is used to determine the relationship between market risk premium and expected returns

Classification:

Classification models refers to a whole host of machine learning models which is used to determine a class or label for a given input data. It is a technique where we classify each entity to a specific category. Classification can be applied to both structured and unstructured data. Classification can be broadly divided into two types

  1. Multi Class Classification where classify the incoming data point to only one class at a time
  2. Multi Label Classification where an incoming data point can belong to multiple classes at the same time.

There is a whole host of classification algorithms out there, that applies different mathematical techniques to classify the incoming datapoints to their corresponding classes. Each of them has their own advantages and disadvantages and depending on the scenario one might work better than the other. It is up to the user to use their discretion to find the algorithm that best fits their scenario.  We have identified some of the most famous ones below

  1. Decision Trees – An algorithm that uses a tree-based decision flowchart to arrive at the final class
  2. Naïve Bayes Classifier – This algo uses the Bayes Theorem on conditional probability to classify the incoming data
  3. Support Vector Machines – A margin-based classifier that uses maximum margin between data points to separate them into multiple classes

The above set of algorithms are stand alone algorithms used to classify the data. However, since they act alone, they have their own set of limitations in terms of accuracy and applicability. This disadvantage can be overcome by using a new class of classification algorithms called Ensemble algorithms. Imagine if one student had to solve a math problem versus an entire classroom. As a class, students can collaboratively solve the problem by checking each other’s answers and unanimously decide on a single answer. On the other hand, the individual doesn’t have this privilege — nobody else is there to validate his/her answer if it’s wrong, and so, the classroom with several students is similar to an ensemble learning algorithm with several smaller algorithms working together to formulate a final response. Examples of Ensemble algorithms are 

  1. Random Forests – Ensemble version of decision trees where multiple trees produce different outcome and final outcome is determined by a majority vote
  2. XGBoost – An Ensemble learning algorithm that uses the results of many base learners to arrive at the final decision.

Classification algorithms similar to regression has applications in wide variety of scenarios that it’s difficult to list them all. Basically, any problem statement that requires us to assign a label to a data point uses classification. Some of the real-world use cases for classification are 

  1. Spam Classification in mail boxes
  2. Movie Genre Identification
  3. Product Category assignment in E-Commerce sites

Deep Learning:

Deep Learning is sub category of Supervised Machine Learning, which uses Deep Neural Networks to either classify the input data or quantify the relationships in the input data. It differs from the other machine learning algorithms in that it doesn’t require any manual design with respect to the problem statement. It learns to generate the equation along with the parameters of the equation during the training phase. 

Deep Learning algorithms stack Neural Networks on top of another which when provided with a set of examples, say images of people, it can go from finding simple features, such as edges and contours, to more complicated features, such as eyes, noses, ears, faces and bodies. Each layer of a neural network is composed of neurons, each of which form an individual computational unit in itself. The algorithm works by adjusting the weights of each neuron to be able to detect common patterns in the input data.

There are different classes of Neural Networks available out there. Some of the most famous ones are as follows

  1. Fully Connected Neural Networks – Basic Neural networks that uses Fully connected layers
  2. Convolutional Neural Networks – That uses Convolutional layers to learn spatial patterns. Used predominantly in Image analysis
  3. Recurrent Neural Networks – Class of Neural networks used to learn temporal patterns in the data. Used predominantly in Text analysis and NLP

Deep Learning has become so pervasive in our day-to-day life. In fact, it is the prime factor driving the Data Science and Artificial Intelligence revolution in the 21st century. Some of the applications of Deep Learning are as follows.

  1. Content Recommendation Engines
  2. Self-Driving Cars
  3. Chatbots
  4. Facial Recognition Software
  5. Language Translation Engines etc

About the author

Sashi is a Sr. Data Scientist at Alpha Ori Technologies.
Contact: sasidharan@alphaori.sg

Get in touch