Multivariate Linear Regression
Multivariate linear regression is a statistical methodology used to mannequin the connection between a number of unbiased variables and a single dependent variable. It’s an extension of easy linear regression, which solely includes one unbiased variable. In multivariate linear regression, the objective is to seek out the equation that greatest predicts the worth of the dependent variable based mostly on the values of the unbiased variables. The equation is within the type of Y = a + b1X1 + b2X2 + … + bnXn, the place Y is the dependent variable, X1, X2, …, Xn are the unbiased variables, a is the fixed time period, and b1, b2, …, bn are the coefficients that signify the connection between every unbiased variable and the dependent variable.
What we do on this?
We precisely predict costs value?
Columns current in dataset:
age
: age of major beneficiary
intercourse
: insurance coverage contractor gender, feminine, male
bmi
: Physique mass index, offering an understanding of physique, weights which might be comparatively excessive or low relative to top, goal index of physique weight (kg / m ^ 2) utilizing the ratio of top to weight, ideally 18.5 to 24.9.
youngsters
: Variety of youngsters coated by medical insurance / Variety of dependents
smoker
: Smoking
area
: the beneficiary’s residential space within the US, northeast, southeast, southwest, northwest.
costs
: Particular person medical prices billed by medical insurance
Importing Necessary libreries
import pandas as pd
import numpy as np
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error
import matplotlib.pyplot as plt
Studying recordsdata
Beneath code makes use of the read_csv()
operate from the pandas library to learn within the medical insurence information from a csv file and assigns the ensuing dataframe to a variable named df
.
df = pd.read_csv('/kaggle/enter/insurance coverage/insurance coverage.csv')
df.head()
age | intercourse | bmi | youngsters | smoker | area | costs | |
---|---|---|---|---|---|---|---|
0 | 19 | feminine | 27.900 | 0 | sure | southwest | 16884.92400 |
1 | 18 | male | 33.770 | 1 | no | southeast | 1725.55230 |
2 | 28 | male | 33.000 | 3 | no | southeast | 4449.46200 |
3 | 33 | male | 22.705 | 0 | no | northwest | 21984.47061 |
4 | 32 | male | 28.880 | 0 | no | northwest | 3866.85520 |
Characteristic engineering
Subsequent, we applies one-hot encoding to the intercourse
, area
, and smoker
columns of the dataframe and assigns the ensuing dataframe to a brand new variable df_encoded
.
# Apply one-hot encoding to "shade" column
df_encoded = pd.get_dummies(df, columns=['sex', 'region', 'smoker'])
df_encoded
age | bmi | youngsters | costs | sex_female | sex_male | region_northeast | region_northwest | region_southeast | region_southwest | smoker_no | smoker_yes | |
---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 19 | 27.900 | 0 | 16884.92400 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 |
1 | 18 | 33.770 | 1 | 1725.55230 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
2 | 28 | 33.000 | 3 | 4449.46200 | 0 | 1 | 0 | 0 | 1 | 0 | 1 | 0 |
3 | 33 | 22.705 | 0 | 21984.47061 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
4 | 32 | 28.880 | 0 | 3866.85520 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
… | … | … | … | … | … | … | … | … | … | … | … | … |
1333 | 50 | 30.970 | 3 | 10600.54830 | 0 | 1 | 0 | 1 | 0 | 0 | 1 | 0 |
1334 | 18 | 31.920 | 0 | 2205.98080 | 1 | 0 | 1 | 0 | 0 | 0 | 1 | 0 |
1335 | 18 | 36.850 | 0 | 1629.83350 | 1 | 0 | 0 | 0 | 1 | 0 | 1 | 0 |
1336 | 21 | 25.800 | 0 | 2007.94500 | 1 | 0 | 0 | 0 | 0 | 1 | 1 | 0 |
1337 | 61 | 29.070 | 0 | 29141.36030 | 1 | 0 | 0 | 1 | 0 | 0 | 0 | 1 |
1338 rows à 12 columns
df_encoded.columns
Index(['age', 'bmi', 'children', 'charges', 'sex_female', 'sex_male',
'region_northeast', 'region_northwest', 'region_southeast',
'region_southwest', 'smoker_no', 'smoker_yes'],
dtype="object")
Characteristic choice
Subsequent, the code selects the related columns of the encoded dataframe to make use of as unbiased variables (X) and the dependent variable (y) for the linear regression mannequin.
X = df_encoded[['age', 'bmi', 'children', 'sex_female', 'sex_male',
'region_northeast', 'region_northwest', 'region_southeast',
'region_southwest', 'smoker_no', 'smoker_yes']]
y = df_encoded['charges']
Making ready mannequin
Beneath code splits the information into coaching and testing units utilizing the train_test_split operate, suits the linear regression mannequin utilizing the coaching information and prints the MSE of the mannequin.
# break up the information into coaching and take a look at units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
# create a linear regression mannequin
mannequin = LinearRegression()
# prepare the mannequin on the coaching information
train_loss = []
test_loss = []
# prepare the mannequin
for i in vary(100):
mannequin.match(X_train, y_train)
train_loss.append(mean_squared_error(y_train, mannequin.predict(X_train)))
test_loss.append(mean_squared_error(y_test, mannequin.predict(X_test)))
train_score = mannequin.rating(X_train, y_train)
test_score = mannequin.rating(X_test, y_test)
# predict the values for the coaching and take a look at units
y_train_pred = mannequin.predict(X_train)
y_test_pred = mannequin.predict(X_test)
# Plot the prediction line
plt.scatter(y_train, y_train_pred,label='prepare')
plt.scatter(y_test, y_test_pred,label='take a look at')
plt.legend()
plt.xlabel("Precise values")
plt.ylabel("Predicted values")
plt.title("Prediction line")
plt.present()
# Plot the residuals
plt.scatter(y_train_pred, y_train_pred - y_train,label='prepare')
plt.scatter(y_test_pred, y_test_pred - y_test,label='take a look at')
plt.legend()
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.present()
# Plot the loss
plt.plot(train_loss, label='prepare')
plt.plot(test_loss, label='take a look at')
plt.legend()
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.title("Loss Plot")
plt.present()
Total, this code is performing a linear regression evaluation on an insurance coverage dataset. It begins by importing the required libraries for the evaluation, then reads within the information from a csv file utilizing pandas, applies one-hot encoding to sure columns, selects the related columns to make use of within the mannequin, and at last splits the information into coaching and testing units and suits a linear regression mannequin to the coaching information. The final line prints the MSE of the mannequin as a measure of efficiency.