Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?

Predicting Medical Costs using Multivariate Linear Regression in Python



Multivariate Linear Regression

Multivariate linear regression is a statistical methodology used to mannequin the connection between a number of unbiased variables and a single dependent variable. It’s an extension of easy linear regression, which solely includes one unbiased variable. In multivariate linear regression, the objective is to seek out the equation that greatest predicts the worth of the dependent variable based mostly on the values of the unbiased variables. The equation is within the type of Y = a + b1X1 + b2X2 + … + bnXn, the place Y is the dependent variable, X1, X2, …, Xn are the unbiased variables, a is the fixed time period, and b1, b2, …, bn are the coefficients that signify the connection between every unbiased variable and the dependent variable.



What we do on this?

We precisely predict costs value?

Columns current in dataset:

age: age of major beneficiary

intercourse: insurance coverage contractor gender, feminine, male

bmi: Physique mass index, offering an understanding of physique, weights which might be comparatively excessive or low relative to top, goal index of physique weight (kg / m ^ 2) utilizing the ratio of top to weight, ideally 18.5 to 24.9.

youngsters: Variety of youngsters coated by medical insurance / Variety of dependents

smoker: Smoking

area: the beneficiary’s residential space within the US, northeast, southeast, southwest, northwest.

costs: Particular person medical prices billed by medical insurance



Importing Necessary libreries

import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error


import matplotlib.pyplot as plt
Enter fullscreen mode

Exit fullscreen mode



Studying recordsdata

Beneath code makes use of the read_csv() operate from the pandas library to learn within the medical insurence information from a csv file and assigns the ensuing dataframe to a variable named df.

df = pd.read_csv('/kaggle/enter/insurance coverage/insurance coverage.csv')
df.head()
Enter fullscreen mode

Exit fullscreen mode

age intercourse bmi youngsters smoker area costs
0 19 feminine 27.900 0 sure southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520



Characteristic engineering

Subsequent, we applies one-hot encoding to the intercourse, area, and smoker columns of the dataframe and assigns the ensuing dataframe to a brand new variable df_encoded.

# Apply one-hot encoding to "shade" column
df_encoded = pd.get_dummies(df, columns=['sex', 'region', 'smoker'])
df_encoded
Enter fullscreen mode

Exit fullscreen mode

age bmi youngsters costs sex_female sex_male region_northeast region_northwest region_southeast region_southwest smoker_no smoker_yes
0 19 27.900 0 16884.92400 1 0 0 0 0 1 0 1
1 18 33.770 1 1725.55230 0 1 0 0 1 0 1 0
2 28 33.000 3 4449.46200 0 1 0 0 1 0 1 0
3 33 22.705 0 21984.47061 0 1 0 1 0 0 1 0
4 32 28.880 0 3866.85520 0 1 0 1 0 0 1 0
1333 50 30.970 3 10600.54830 0 1 0 1 0 0 1 0
1334 18 31.920 0 2205.98080 1 0 1 0 0 0 1 0
1335 18 36.850 0 1629.83350 1 0 0 0 1 0 1 0
1336 21 25.800 0 2007.94500 1 0 0 0 0 1 1 0
1337 61 29.070 0 29141.36030 1 0 0 1 0 0 0 1

1338 rows × 12 columns

df_encoded.columns
Enter fullscreen mode

Exit fullscreen mode

Index(['age', 'bmi', 'children', 'charges', 'sex_female', 'sex_male',
       'region_northeast', 'region_northwest', 'region_southeast',
       'region_southwest', 'smoker_no', 'smoker_yes'],
      dtype="object")
Enter fullscreen mode

Exit fullscreen mode



Characteristic choice

Subsequent, the code selects the related columns of the encoded dataframe to make use of as unbiased variables (X) and the dependent variable (y) for the linear regression mannequin.

X = df_encoded[['age', 'bmi', 'children', 'sex_female', 'sex_male',
       'region_northeast', 'region_northwest', 'region_southeast',
       'region_southwest', 'smoker_no', 'smoker_yes']]
y = df_encoded['charges']
Enter fullscreen mode

Exit fullscreen mode



Making ready mannequin

Beneath code splits the information into coaching and testing units utilizing the train_test_split operate, suits the linear regression mannequin utilizing the coaching information and prints the MSE of the mannequin.

# break up the information into coaching and take a look at units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
Enter fullscreen mode

Exit fullscreen mode

# create a linear regression mannequin
mannequin = LinearRegression()
Enter fullscreen mode

Exit fullscreen mode

# prepare the mannequin on the coaching information
train_loss = []
test_loss = []

# prepare the mannequin
for i in vary(100):
    mannequin.match(X_train, y_train)
    train_loss.append(mean_squared_error(y_train, mannequin.predict(X_train)))
    test_loss.append(mean_squared_error(y_test, mannequin.predict(X_test)))
Enter fullscreen mode

Exit fullscreen mode

train_score = mannequin.rating(X_train, y_train)
test_score = mannequin.rating(X_test, y_test)
Enter fullscreen mode

Exit fullscreen mode

# predict the values for the coaching and take a look at units
y_train_pred = mannequin.predict(X_train)
y_test_pred = mannequin.predict(X_test)
Enter fullscreen mode

Exit fullscreen mode

# Plot the prediction line
plt.scatter(y_train, y_train_pred,label='prepare')
plt.scatter(y_test, y_test_pred,label='take a look at')
plt.legend()
plt.xlabel("Precise values")
plt.ylabel("Predicted values")
plt.title("Prediction line")
plt.present()
Enter fullscreen mode

Exit fullscreen mode

# Plot the residuals
plt.scatter(y_train_pred, y_train_pred - y_train,label='prepare')
plt.scatter(y_test_pred, y_test_pred - y_test,label='take a look at')
plt.legend()
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.present()
Enter fullscreen mode

Exit fullscreen mode

residuals

# Plot the loss
plt.plot(train_loss, label='prepare')
plt.plot(test_loss, label='take a look at')
plt.legend()
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.title("Loss Plot")
plt.present()
Enter fullscreen mode

Exit fullscreen mode

Loss

Total, this code is performing a linear regression evaluation on an insurance coverage dataset. It begins by importing the required libraries for the evaluation, then reads within the information from a csv file utilizing pandas, applies one-hot encoding to sure columns, selects the related columns to make use of within the mannequin, and at last splits the information into coaching and testing units and suits a linear regression mannequin to the coaching information. The final line prints the MSE of the mannequin as a measure of efficiency.

Buy Me A Coffee

Add a Comment

Your email address will not be published. Required fields are marked *

Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?