Multiple linear regression refers to a statistical technique that is used to predict the outcome of a variable based on the value of two or more variables.
Assumptions of Multiple Linear Regression
- Linearity
- Homoscedasticity
- Multivariate normality
- Independence of errors
- Lack of multicollinearity
y=b0+b1*x1+b2*x12+……+bn*xn+d1
In this case we should aware of Dummy variable trap that is if we create dummy variable for a column we should ignore one of them. for example in one column there are 3 varieties of names present like state name so we should create only two for them here we can tell the missing one is the last one. we we didn’t ignore we should face Dummy variable trap.
Building a Model
Forward Selection
Step1: Select the significance level to enter the model
Step2: Fit all regression models y~Xn select one of them with lowest P value
Step3: Keep this variable and fit all possible models with one extra predictor added to the one you already have
Step4: Consider the predictor with the lowest P-value if P
Bidirectional Elimination
Step1:Select significance level to enter and to stay in the model eg:STENTER=0.05 ,SLSTAY=0.05
Step2:next step is forward selection (new variables must have P
Step3:Perform all steps of backward elimination (old variables must have P
Step4:finally no variables to enter no variables to exit then our model is ready
All Possible Models
Step1:Select a criterion of goodness of fit (e.g:Akaike criterion-AIC = -2(log-likelihood) + 2K)
Step2:Construct all possible regression models : 2^n -1 total combinations
Step3: Select one of them with best criterion that’s all : but in this case if we have many columns it is not suitable coz it consumes lot of resources.
5 Methods of building Models
- All-in
- Backward Elimination
- Forward Selection
- Bidirectional Elimination
- Score comparison
Implementation of Linear Regression
# Importing the libraries
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd
# Importing the dataset
dataset = pd.read_csv(’50_Startups.csv’)
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(X)
# Encoding categorical data
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OneHotEncoder
ct = ColumnTransformer(transformers=[(‘encoder’, OneHotEncoder(), [3])], remainder=’passthrough’)
X = np.array(ct.fit_transform(X))
print(X)
# Splitting the dataset into the Training set and Test set
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 0)
# Training the Multiple Linear Regression model on the Training set
from sklearn.linear_model import LinearRegression
regressor = LinearRegression()
regressor.fit(X_train, y_train)
# Predicting the Test set results
y_pred = regressor.predict(X_test)
np.set_printoptions(precision=2)
print(np.concatenate((y_pred.reshape(len(y_pred),1), y_test.reshape(len(y_test),1)),1))
Resources ~ from internet.