Multiple Linear Regression

2 min readAug 9, 2021

Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on the value of two or more variables.

Assumptions of Multiple Linear Regression

Linearity
Homoscedasticity
Multivariate normality
Independence of errors
Lack of multicollinearity

y=b0+b1x1+b2x12+……+bn*xn+d1

In this case we should aware of Dummy variable trap that is if we create dummy variable for a column we should ignore one of them. for example in one column there are 3 varieties of names present like state name so we should create only two for them here we can tell the missing one is the last one. we we didn’t ignore we should face Dummy variable trap.

Building a Model

Forward Selection

Step1: Select the significance level to enter the model

Step2: Fit all regression models y~Xn select one of them with lowest P value

Step3: Keep this variable and fit all possible models with one extra predictor added to the one you already have

Step4: Consider the predictor with the lowest P-value if P

Bidirectional Elimination

Step1:Select significance level to enter and to stay in the model eg:STENTER=0.05 ,SLSTAY=0.05

Step2:next step is forward selection (new variables must have P

Step3:Perform all steps of backward elimination (old variables must have P

Step4:finally no variables to enter no variables to exit then our model is ready

All Possible Models

Step1:Select a criterion of goodness of fit (e.g:Akaike criterion-AIC = -2(log-likelihood) + 2K)

Step2:Construct all possible regression models : 2^n -1 total combinations

Step3: Select one of them with best criterion that’s all : but in this case if we have many columns it is not suitable coz it consumes lot of resources.

5 Methods of building Models

All-in
Backward Elimination
Forward Selection
Bidirectional Elimination
Score comparison

Implementation of Linear Regression

# Importing the libraries

import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

# Importing the dataset

dataset = pd.read_csv(’50_Startups.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X)

# Encoding categorical data

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), [3])], remainder=’passthrough’)
X = np.array(ct.fit_transform(X))
print(X)

# Splitting the dataset into the Training set and Test set

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)

# Training the Multiple Linear Regression model on the Training set

from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)

# Predicting the Test set results

y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))

Resources ~ from internet.