# Predicting Medical Costs using Multivariate Linear Regression in Python

## Multivariate Linear Regression

Multivariate linear regression is a statistical methodology used to mannequin the connection between a number of unbiased variables and a single dependent variable. It’s an extension of easy linear regression, which solely includes one unbiased variable. In multivariate linear regression, the objective is to seek out the equation that greatest predicts the worth of the dependent variable based mostly on the values of the unbiased variables. The equation is within the type of Y = a + b1X1 + b2X2 + … + bnXn, the place Y is the dependent variable, X1, X2, …, Xn are the unbiased variables, a is the fixed time period, and b1, b2, …, bn are the coefficients that signify the connection between every unbiased variable and the dependent variable.

## What we do on this?

We precisely predict costs value?

Columns current in dataset:

`age`: age of major beneficiary

`intercourse`: insurance coverage contractor gender, feminine, male

`bmi`: Physique mass index, offering an understanding of physique, weights which might be comparatively excessive or low relative to top, goal index of physique weight (kg / m ^ 2) utilizing the ratio of top to weight, ideally 18.5 to 24.9.

`youngsters`: Variety of youngsters coated by medical insurance / Variety of dependents

`smoker`: Smoking

`area`: the beneficiary’s residential space within the US, northeast, southeast, southwest, northwest.

`costs`: Particular person medical prices billed by medical insurance

## Importing Necessary libreries

``````import pandas as pd
import numpy as np

from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import mean_squared_error

import matplotlib.pyplot as plt
``````

## Studying recordsdata

Beneath code makes use of the `read_csv()` operate from the pandas library to learn within the medical insurence information from a csv file and assigns the ensuing dataframe to a variable named `df`.

``````df = pd.read_csv('/kaggle/enter/insurance coverage/insurance coverage.csv')
``````

age intercourse bmi youngsters smoker area costs
0 19 feminine 27.900 0 sure southwest 16884.92400
1 18 male 33.770 1 no southeast 1725.55230
2 28 male 33.000 3 no southeast 4449.46200
3 33 male 22.705 0 no northwest 21984.47061
4 32 male 28.880 0 no northwest 3866.85520

## Characteristic engineering

Subsequent, we applies one-hot encoding to the `intercourse`, `area`, and `smoker` columns of the dataframe and assigns the ensuing dataframe to a brand new variable `df_encoded`.

``````# Apply one-hot encoding to "shade" column
df_encoded = pd.get_dummies(df, columns=['sex', 'region', 'smoker'])
df_encoded
``````

age bmi youngsters costs sex_female sex_male region_northeast region_northwest region_southeast region_southwest smoker_no smoker_yes
0 19 27.900 0 16884.92400 1 0 0 0 0 1 0 1
1 18 33.770 1 1725.55230 0 1 0 0 1 0 1 0
2 28 33.000 3 4449.46200 0 1 0 0 1 0 1 0
3 33 22.705 0 21984.47061 0 1 0 1 0 0 1 0
4 32 28.880 0 3866.85520 0 1 0 1 0 0 1 0
1333 50 30.970 3 10600.54830 0 1 0 1 0 0 1 0
1334 18 31.920 0 2205.98080 1 0 1 0 0 0 1 0
1335 18 36.850 0 1629.83350 1 0 0 0 1 0 1 0
1336 21 25.800 0 2007.94500 1 0 0 0 0 1 1 0
1337 61 29.070 0 29141.36030 1 0 0 1 0 0 0 1

1338 rows Ã 12 columns

``````df_encoded.columns
``````

``````Index(['age', 'bmi', 'children', 'charges', 'sex_female', 'sex_male',
'region_northeast', 'region_northwest', 'region_southeast',
'region_southwest', 'smoker_no', 'smoker_yes'],
dtype="object")
``````

## Characteristic choice

Subsequent, the code selects the related columns of the encoded dataframe to make use of as unbiased variables (X) and the dependent variable (y) for the linear regression mannequin.

``````X = df_encoded[['age', 'bmi', 'children', 'sex_female', 'sex_male',
'region_northeast', 'region_northwest', 'region_southeast',
'region_southwest', 'smoker_no', 'smoker_yes']]
y = df_encoded['charges']
``````

Beneath code splits the information into coaching and testing units utilizing the train_test_split operate, suits the linear regression mannequin utilizing the coaching information and prints the MSE of the mannequin.

``````# break up the information into coaching and take a look at units
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
``````

``````# create a linear regression mannequin
mannequin = LinearRegression()
``````

``````# prepare the mannequin on the coaching information
train_loss = []
test_loss = []

# prepare the mannequin
for i in vary(100):
mannequin.match(X_train, y_train)
train_loss.append(mean_squared_error(y_train, mannequin.predict(X_train)))
test_loss.append(mean_squared_error(y_test, mannequin.predict(X_test)))
``````

``````train_score = mannequin.rating(X_train, y_train)
test_score = mannequin.rating(X_test, y_test)
``````

``````# predict the values for the coaching and take a look at units
y_train_pred = mannequin.predict(X_train)
y_test_pred = mannequin.predict(X_test)
``````

``````# Plot the prediction line
plt.scatter(y_train, y_train_pred,label='prepare')
plt.scatter(y_test, y_test_pred,label='take a look at')
plt.legend()
plt.xlabel("Precise values")
plt.ylabel("Predicted values")
plt.title("Prediction line")
plt.present()
``````

``````# Plot the residuals
plt.scatter(y_train_pred, y_train_pred - y_train,label='prepare')
plt.scatter(y_test_pred, y_test_pred - y_test,label='take a look at')
plt.legend()
plt.xlabel("Predicted values")
plt.ylabel("Residuals")
plt.title("Residual Plot")
plt.present()
``````

``````# Plot the loss
plt.plot(train_loss, label='prepare')
plt.plot(test_loss, label='take a look at')
plt.legend()
plt.xlabel("Iterations")
plt.ylabel("Loss")
plt.title("Loss Plot")
plt.present()
``````

Total, this code is performing a linear regression evaluation on an insurance coverage dataset. It begins by importing the required libraries for the evaluation, then reads within the information from a csv file utilizing pandas, applies one-hot encoding to sure columns, selects the related columns to make use of within the mannequin, and at last splits the information into coaching and testing units and suits a linear regression mannequin to the coaching information. The final line prints the MSE of the mannequin as a measure of efficiency.