Sunday, August 18, 2019

Multiple Linear Regression


Multiple Linear Regression is a process that uses multiple explanatory variables to predict the outcome of a response variable. The purpose of Multiple Linear Regression is to model the linear relationship between multiple variables and the response variable.
This is also an extension of Linear Regression that involves more than one independent variables.

The Formula for Multiple Linear Regression is : Yi  = β0 + β1x1 +  β2x2 + … + βnxn + ϵ

Multiple linear regression model with n independent variables and one dependent variable Y.

Here
Yi <- the dependent variable and
X1, X2, X3 ….., Xn <- independent variables / model parameters / explanatory variables
β0 <-  is a constant or Intercept of the slope and
ϵ  =  the error ( residuals )
As explained earlier, multiple regression model extends to several explanatory variables. A simple linear regression is used to make predictions about one variable based on the information that is known about another variable.  here. However, multiple Linear Regression model is extended to predict using  the information that is known about multiple variables.


The basic purpose of the least-square regression is to fit a hyper-plane into ( n+1 ) dimension that minimizes the SSE.
SSE <- Sum of Squares of Residual errors ( SSE ) ∑( Yi − Yihat ) ^2  where Yihat is the predicted value
SSR <- is the Sum of Squares due to Regression ∑( Yihat − Yiµ ) ^2 where Yiµ is the mean value of data
SST <- Total Sum of Squared deviations in Yi from its mean Yihat

SST = SSR + SSE

R-Squared : The coefficient of determination ( R-Squared ) is a square of the correlation co-efficient between dependent variable Y and the fitted values Yhat 
Correlation coefficient between response variable & Predictor variables

Generally,    R-Squared = SSR / SST  . Substituting the above formula for SSR = SST - SSE

R-Squared Formula becomes  1 - SSE / SST

Taking an example of 50-startups data. We will be predicting the profit of the companies located in different locations.
We will also see which variable is highly correlated with the response variable.

Steps to predict using Multiple Linear regression
  1. After importing the datafile
  2. Validate the data, look at the categorical variable
  3. apply OneHotEncoding feature to categorical variables. I will explain how this works in a separate blog.
  4. Split the dataset into Train and Test set
  5. Fit the model using Multiple Linear Regression 

Here, using all the variables to fit the model might not give a good accuracy, because, there could be some variables that are not significant or correlated with the response variable and hence might not add a lot of value in the prediction.
Hence, there are multiple different approaches to find the highly correlated variables that contribute for building an appropriate model.
  1. Use All   -  Use all the variables to predict the response variable, eliminate them based on the significance P-Value
  2. Stepwise Regression -
    1. Backward Elimination : Below steps are followed to build a model using Backward Elimination process
      1. Select Significant level ( P- Value ) - ( 0.5 - 5 ) %
      2. Fit the model
      3. Remove variables that higher than P-Value ( least significant one by one )
      4. Fit the model
      5. Continue the process until you find all the variables that are highly correlated with Response variable

  1. Forward Selection :
    1. Select Significant level ( P- Value ) - ( 0.5 - 5 ) %
    2. Fit the model with first variable
    3. Remove / include the variables that less than P-Value ( highly significant variables one by one)
    4. Fit the model
    5. Continue the process until you find all the variables that are highly correlated with Response variable

  1. Score Comparison - Compare the scores of P-Values for eliminating the variables one by one.
  2. Select Goodness of Fit ( Akaike Information Criteria ) : StepAIC method to find the right set of variables that are highly significant for building a model.

You can find the code at GitHub

Looking at the code in R :- 


# Multiple Linear Regression
# Created by Deepak Pradeep Kumar

#Setting working directory

setwd("<<Set Your Folder here>>")

# Importing the dataset
dataset = read.csv("50_Startups.csv")

#Encoding the Categorical variable
dataset$State = factor(dataset$State,
                       levels = c('New York','California','Florida'),
                       labels = c(1,2,3))

# Splitting the dataset into trainingset and testset
library(caTools)
set.seed(101)
split = sample.split(dataset$Profit, SplitRatio = 0.8)
trainingset = subset(dataset, split == TRUE)
testset = subset(dataset, split == FALSE)

# Feature Scaling is not needed as it will take care of Fit function that we use.
# Fitting Multiple Linear Regression to training set

regressor = lm(formula = Profit ~ R.D.Spend + Administration + Marketing.Spend + State,
               data = trainingset)

summary(regressor)

#Removing Administration and Fitting the model
regressor1 = lm(formula = Profit ~ R.D.Spend  + Marketing.Spend + State,
               data = trainingset)

summary(regressor1)

#Removing State and Fitting the model 
#Ideally, we add Administration back and remove only state and fit the model. In this case, it is fine to remove both the variables
regressor2 = lm(formula = Profit ~ R.D.Spend + Marketing.Spend,
               data = trainingset)

summary(regressor2)

#Removing Marketing Spend and also and Fitting the model
regressor3 = lm(formula = Profit ~ R.D.Spend,
               data = trainingset)

summary(regressor3)

# Backward Elimination using StepAIC method
# Use this method immediately after Fitting the model
#install.packages('MASS')
library(MASS)
stepAIC(regressor)


# Predicting the TEST SET Results
y_predictor = predict(regressor, newdata = testset)

Wednesday, August 14, 2019

Assumptions of Linear Regression & Hypothesis Testing

In the previous blog we learnt how to use Linear Regression to predict response variables with only one predictor / dependent variable. As there was only one response and one dependent variable, it was termed as Linear Regression.
For the simplicity of implementing mathematical and computational techniques on the dataset for prediction, we apply certain generic  assumptions on the dataset for this regression :-

  1. Linearity -  Independent and dependent variables should / must have relationship between each other. Relationship between variables can be understood or tested by plotting them on the graphs.

  1. Homoscedasticity - The variance of all random variables in a dataset is finite. Scatter plot can show whether data are homoscedastic or not.

  1. Lack of Multicollinearity - Multicollinearity occurs when more than two independent variables are highly correlated with each other. This doesn't occur here as there's only one independent variable in the dataset.

  1. Multivariate Normality -  All correlated random variables clusters around a mean value. This assumption can be tested by plotting the histogram which usually shows normality ( left / right aligned ) of the data.

  1. Independence of Errors  - Linear regression analysis requires that the residuals of the data is not independent from each other. Autocorrelation occurs when the residuals are not independent from each other. This can be tested using Null Hypothesis

Hypothesis Testing :

Hypothesis is the statement or an assumption about relationships between variables.
Criteria for constructing Hypothesis is that,
  1. It should be precise and not contradictory
  2. It should be testable whether it is right or wrong.
  3. Should specify the variables between which the relationship is established.

Types of Hypothesis :

  1. Null Hypothesis ( H0 )  - Null Hypothesis claims that there's no different in the population of data.
  2. Alternative Hypothesis ( H0 or H1 ) - Alternate Hypothesis claims H0 is false.

Hypothesis testing is also called Significance testing.

The Major purpose of hypothesis testing is to choose between two competing hypothesis about the value of a population parameter. For Ex:- If

One hypothesis says the salaries of men and women at every level is equal, while the alternative might claim that it is false.
Null Hypothesis usually referred with the symbol H0 and the other hypothesis which is assumed to be true when the null hypothesis is proved false, which is referred as alternative hypothesis.

Generally, for the sake of convenience we take the Null Hypothesis on the equal side for the sample of population, so that, when it is proved false, the alternative will be either less than or more than a certain value associated with the null hypothesis.

Ideally, true value of the population parameter should be included in the set specified by H0 or in the set specified by H1.

There's a very famous illustrative example of "Body weight"  where the age group of 20 - 29 years old men in US had a mean body weight of 170 pounds. Standard deviation   σ = 40 pounds.
So, here we test the
  1. Null hypothesis H0 = µ = 170 pounds
  2. Alternative hypothesis = H1 =  µ > 170 pounds or µ < 170 pounds.

Let's say we take multiple random samples of data to validate the Null hypothesis.
Random samples of n = 64.
Based on the Sampling Distribution of a Mean ( SDM ) :  Xbar = N( µ, SE )   where SE =  σ / √n

Applying this formula to calculate the Zstat :-   Zstat =  Xbar - µ0 / SE  ( Where µ0 = population mean )

Test Statistic Zstat = Xbar - µ0  / ( σ / √n )

Now, we will take a few samples from the population where Sample mean Xbar is  ( first sample mean = 172, second sample mean = 164 )
Substituting the values of  µ0 = 170,  σ = 40, and SE = 5  in various samples
We will get the Zstat scores as follows
  1. Test Statistic Zstat1  =  ( 172 - 170 ) / 5  = 0.4  
  2. Test Statistic Zstat2  = ( 183 - 170 ) / 5   = 2.6
  3. Test Statistic Zstat3  = ( 164 - 170 ) / 5 = -1.2

Now Calculating the P-Value - AUC ( Area Under Curve ) of the above found test statistic.

P-Value :- is the probability of obtaining a sample that is closer than the observed data, assuming that H0 ( Null hypothesis ) is true.

P-Value is a calculation that we make during the hypothesis testing to determine if we reject the null hypothesis or fail to reject it.

How to find P-Value :   P-Value is calculated by first determining the Zstat score. Then, we find the probability in a normal distribution table and then interpret the results by comparing the p-value to the level of significance.
Normal distribution chart to find P-Value is available on the back of every statistical text book and also easily available on the internet.
Normal Distribution chart can be found here :

Calculating the p-Value :-  Probability(Zstat)

P-Value of Zstat1 =  0.6554 
P-Value of Zstat2 = 0.9953  
P-Value of Zstat3  =  0.1150

These P-Values corresponds to the AUC ( Area Under Curve ) in the tail of  the standard Normal Distribution beyond the Zstat.

As mentioned above,  P-Value is the probability of the observed test statistic when H0 Null Hypothesis is true.
Hence, smaller and smaller the P-Values provide stronger evidence against H0

In this scenario, if we have 5% probability of erroneously rejecting the H0. in other words, setting the threshold Ɛ = 5% = 0.05 to erroneously rejecting the H0.
  1. Reject H0 if P <= Ɛ
  2. Retain H0 if P > Ɛ

Example:-  If  Ɛ = 0.05 and P = 0.655  =>   Retain H0


SUMMARIZING THE EVENTS FOR HYPOTHESIS TESTING :-

  1. State H0 and H1
  2. Specify the level of significance ( Ɛ )
  3. Decide the appropriate sampling distribution
  4. Calculate the Zstat ( test statistic)  using the formula mentioned above.
  5. Lookup for the Probability P-Value of the Zstat ( test statistic)
  6. If the P-Value is smaller than the Ɛ value - Reject the Hypothesis else Retain the Hypothesis

Thursday, August 8, 2019

Simple Linear Regression with R & Python


Simple Linear Regression is a process of regression in finding relationship of dependent and independent continuous quantitative variables.  It is called Simple because there's only one independent / explanatory or predictor variable that is used to predict the relationship dependency

The Formula for Simple Linear Regression is  : Y = β0 + β1* X + Ɛ


Here Y is the dependent Variable - Ex : How the Salary changes with the No.Of.YearsOfExperience, where Y is the Salary

and X is the Independent Variable ( Yrs.of.experience ). X is a variable that is causing the dependent variable to change based on unit of rate of change occurs in X.
There is still a possibility that there's no direct correlation between the two, but, still have a sort of implied association between dependent and independent variable.

 β1 <- is the coefficient of X, it shows how a unit change in X shows the degree change in Y, which implies that there's a proportion of change that occurs between X & Y and not a one to one change in units.

β0  <- is a constant or Intercept of the slope, which indicates a value when β1 is 0. i.e., when β1 is 0, X = 0 and hence, Y = β0 which is constant OR an intercept of the slope when X = 0.

Let us look at this in the graph for plotting Salary vs Yrs.of.experience.

For any prediction using Classic Machine Learning / Neural Network techniques, there are 5 major steps to be followed [ Fundamentals of Machine Learning ]
  1. Finding the data based on problem statement.
    1. Finding appropriate data for correct analysis
  2. Exploratory data analysis
    1. Understand the data spread and outliers
    2. Visualize the data
    3. Preparing your data
    4. Use Feature engineering on the data as necessary for applying suitable algorithms to train the model.
  3. Fit / Train the model - using appropriate algorithm
    1. Use of appropriate algorithm to train/Fit the model
    2. Developing a model that does better than a baseline
    3. Regularizing the model and turning Hyperparameters.
  4. Apply prediction using the trained model.
    1. Use validation & test data for testing the model
  5. Testing & Data Visualization
    1. Validate results / accuracy using test data sets
    2. Use data visualization techniques to visualize the results.

We will follow these steps, as applicable based on sanity of the data,

We will start with Predicting Salary from the given data with no. of years of experience.

You can find the data file here

You can also find the entire code in this Github repo

Predicting Salary with Simple Linear Regression using R :-


#Import the dataset
dataset = read.csv("Salary.csv")

#install.packages('caTools')
library(caTools)
set.seed(123)

#Splitting the Dataset into Training and Testing dataset.
fragment = sample.split(dataset$Salary, SplitRatio = 2/3, group = NULL)
trainingset = subset(dataset, fragment == TRUE)
testset = subset(dataset, fragment == FALSE)

#Fitting simple Linear model to training set
LinearRegressor = lm(formula = Salary ~ YearsOfExperience,
                     data = trainingset )

summary(LinearRegressor)

<<<<Output of Summary >>>>
#The above command will show summary of the fitted model.
Residuals:
    Min                1Q              Median              3Q               Max
-9279.5          -4339.5         -648.5             3620.7         9748.9
Coefficients:        Estimate Std. Error t value Pr(>|t|)   
(Intercept)               31792.0     1935.7   16.42   <2e-16 ***    <= Intercept is the slope β0 i.e.,, when X = 0, Y =  β0
YearsOfExperience   7739.4      161.9   47.79    <2e-16 ***    <=  *s denotes the significance of the var on Response var
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 5663 on 31 degrees of freedom
Multiple R-squared:  0.9866,        Adjusted R-squared:  0.9862
F-statistic:  2284 on 1 and 31 DF,  p-value: < 2.2e-16

#Predicting the Test Set Results
Predictor = predict(LinearRegressor, newdata = testset)

# Plotting the graph on test set results using ggplot2
#install.packages('ggplot2')
library(ggplot2)

ggplot() +
  geom_point(aes(x = testset$YearsOfExperience, y = testset$Salary),
             colour = 'blue') +
  geom_line(aes(x = testset$YearsOfExperience, y= Predictor),
            colour = 'red') +
  ggtitle('Salary vs Experience') +
  xlab('YearsOfExperience') +
  ylab('Salary')



With Graph plotted, we can see that there's a difference between the Actual Salary and the Predicted Salary.
Based on the Multiple R-squared:  0.9866,        Adjusted R-squared:  0.9862 values, we can evaluate the accuracy of the Prediction.


Predicting Salary with Simple Linear Regression using Python :-


For executing the below code in python, you can use any of the interfaces like ( Jupyter notebook, Pycharm , Spyder  or a plain notepad++ )

I have used Notepad++ and executed through iPython console. Same steps have been applied in both the platforms to predict the response variable.

#Importing libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

#Importing the data
dataset = pd.read_csv('Salary.csv')

# Validate the dataset
dataset.describe()

# Split into Dependent and Independent variables
X = dataset.iloc[:,:-1].values
Y = dataset.iloc[:, 1].values

#X = dataset.YearsOfExperience
#Y = dataset.Salary

# No Manipulation or Imputation of data is needed as this is a Simple and clean dataset.
# Splitting the dataset to Train and Test data.
from sklearn.cross_validation import train_test_split
Xtrain, Xtest, Ytrain, Ytest = train_test_split(X,Y, test_size = 0.3, random_state = 1)

#Fitting the Simple Linear Regression Model
from sklearn.linear_model import LinearRegression
LinearRegressor = LinearRegression()
LinearRegressor.fit(Xtrain, Ytrain)

# Predicting the Test Set Results
YPredictor = LinearRegressor.predict(Xtest)

# Evaluating the Intercept
LinearRegressor.intercept_

#Plotting the Graph for Predicted Test Set Results
plt.scatter(Xtest, Ytest, color = 'blue')
plt.plot(Xtest, YPredictor, color = 'red')
plt.title('Experience vs Salary')
plt.xlabel('Number of Years of Experience')
plt.ylabel('Salary')
plt.show()

#Predicting for any random value
YPred = LinearRegressor.predict([[16.5]])
print(np.float32(YPred))

#Calculating the R Squared value for accuracy
from sklearn.metrics import r2_score
print("r2_score",np.float32(r2_score(Ytest,YPredictor)*100))



In the next blog, we will talk about the Assumptions of Linear Regression and Hypothesis testing.

Regression Analysis with Simple Linear Regression



Regression analysis is defined as applying a set of statistical processes to estimate the relationship between variables. Many techniques are involved in modeling and analyzing these variables.
Regression analysis helps us understand how the values of dependent variables changes when there's a variation in 
any of the independent variables while dealing with multiple independent variables.

We can even understand the variation trend of dependent variables by slightly varying any one of the independent variables and keeping others fixed.

Let’s start with Simple Linear Regression and proceed further with multiple linear regression.

Simple Linear Regression :

As mentioned before, Simple Linear Regression is a process of regression in finding relationship of dependent and independent continuous quantitative variables.  It is called Simple because there's only one independent / explanatory or predictor variable that is used to predict the relationship dependency

In "Multiple" Linear regression, two or more independent variables are used to predict the dependency of a Response variable.

Simple Linear Regression is represented by the formula Y = β0 + β1* X + Ɛ

Here,  β1 is the independent ( explanatory ) variable and Y is the dependent variable. That means, the slight variation of independent variable ( β1 ) will result in variation of predictor (  β0 ) variable.

Identifying / estimating the Correlation & Co-variance between these two variables by regression is defined as the relationship between these two variables.

Covariance shows the direction of the relationship between these variables

Correlation shows the strength of the relationship between these variables.

Covariance, as name suggests, shows how the dependent variable varies when the independent variable is varied.

Let us consider an equation y = mx + c

y = mx + c is an equation of a straight line and the direction of variation on y can be measured by the variation that happens on x.

Lets consider a sample data set of temperature in Fahrenheit v/s heartbeat. ( for the sake of simplicity, I have considered only male data to ignore gender column from the dataset )


degrees
beats per minute
98.4
84
98.4
82
98
78
97.9
72
98.5
68
98
67
97.4
78
98.8
78
99.5
75
97.1
75
98
71
98.9
80
99
75
98.6
77
96.7
71


Now, here Y = beats per minute & X = temperature.

To understand the covariance, let's take the mean of X & Y

Sample mean of Y Ymean = 98.2 degrees.  With the formula provided here
Sample mean of X Xmean = 75.4 degrees.
Understanding the deviation of each observation from the mean Yi & Xi and taking the product.


Product = ( Yi - Ymean ) * ( Xi - Xmean )

Quadrant
Yi - Ymean
Xi - Xmean
( Yi - Ymean ) ( Xi - Xmean )
Relationship
1
+
+
+
+
2
+
-
-
-
3
-
+
-
-
4
-
-
+
+

Plotting the graph with X & Y coordinates and understanding the linear relationship between X & Y as follows :

If the plotted points are in 1st and 3rd Quadrant  -> relationship is positive, i.e.,  if X increases, Y also increases
If the plotted points are in 2nd and 4th Quadrant -> relationship is negative, i.e., if X increases, Y decreases.

Hence, covariance of the variables can be defined as :-


Understanding the covariance between Y & X will provide the direction of how the variables are related to each other.

Substituting the values of the above table in Covariance formula :

COV(Y, X) =  ((98.4 - 98.21)(84-75.4)+(98.4-98.21)(82-75.4)+98 - 98.21)(78-75.4)+……)/ ( 15 - 1))
COV(Y, X) =  13.42000 / 14
COV(Y, X) =  0.9585 * 100 = 95.85%

If Cov(Y,X)  > 0  then we can understand that Y & X are positively correlated  if Cov(Y,X)  < 0 then Y & X have negative relationship.

With this, we can understand that the direction for Y & X is positive

Understanding the Covariance, we can derive the Correlation of the variables :-

Correlation can be defined as :


Where Sy & Sx are the standard deviations of Y & X respectively.

Substituting the Covariance and Standard Deviation :-

COR(Y, X) =  0.9585 / ( 0.75 * 12.10)
COR(Y, X) =  0.1059 * 100 = 10.59 %

This indicates the strength of variables - for every unit increase in X there will be a 10.59% increase in Y.

Standardizing the Data

Standardizing the data means, If the data of Y & X are ranging between thousands of units and they can't be compared or plotted in a graph, these numbers will be standardized based on certain rules.
  1. Subtract the observations from mean value.
  2. Divide each observation by standard deviation  ( check here for standard deviation )

Standardizing Y & X, we will arrive at the above mentioned formula for Correlation.

Correlation can be defined as Covariance between the standardized variables or the ratio of the Covariance to the standard deviation of the two variables.

Calculating the Correlation, result lies between -1 & +1

-1 <= COR ( Y,X ) <=  +1

Hence, these properties make the correlation of Y & X ( Response & Independent variables ) a useful quantity for measuring the direction and strength of the relationship between variables.

More the Cor(Y,X) towards +1, relationship between Y & X is said to be stronger and vice versa.


Multiple Linear Regression

Multiple Linear Regression is a process that uses multiple explanatory variables to predict the outcome of a response variable . The pu...