# Overview

Multiple Linear Regression is a powerful statistical technique used for predicting the value of a dependent variable based on multiple independent variables. It allows you to take "information" from features found in a dataset and then form weights, indicating how much respective features matter in calculating a prediction.

In this project, I analyzed a dataset (publicly available on Kaggle) containing details (`model`

, `year`

, `price`

, `transmission`

, `mileage`

, `fuelType`

, `tax`

, `mpg`

, `engineSize`

) about used Volkswagen cars and successfully created a model that provides used car price predictions. I order to ensure a fair estimate of the model's accuracy, I followed standard Machine Learning procedures which included splitting the original dataset into three parts, the last being a **test** dataset. This **test** dataset was not used by any of the models except the final model that was chosen.

# Feature Engineering

Feature engineering is a core part of this project. I took raw features as found in the data, and engineered new features based on them, which helped to fit a curve to closely match the actual prices found in the dataset. Some techniques used include:

- Creating polynomial features from numerical data.
- Using one-hot encoding to create a new set of features based on categorical data (
`transmission`

,`fuelType`

, etc.).

# Accuracy

Concepts such as bias and variance are discussed, and I explored how to minimize both of these so that predictions based on both the training and cross-validation sets are closely aligned and accurate. To help determine how well the model is performing in terms of accuracy, I quantified the error of the models using **MSE** (**Mean Squared Error**).

Towards the end of the notebook, many additional features were added to a new model, causing high variance to occur, or in other words the model began to over-fit the data. View the notebook to see how, through the use of **regularization**, I was able to address this issue and come up with a final model that does a great job at predicting used car prices.

The resulting model has an MSE of **2.87** which is down from the original (albeit simple) model that had an MSE of **20.4**. These MSE values represent the average squared difference between the predicted and actual car prices, measured in £1,000. An initial simple model resulted in an MSE of 20.4, meaning the predictions were, on average, £4.5K off (square root of 20.4) the actual car prices. After refining the model and applying regularization techniques to minimize overfitting, I was able to reduce the MSE to 2.87 (or £1.7K off the actual car prices), indicating a significant improvement in the prediction accuracy of the model.

# Conclusion

In conclusion, this project provides a great introduction to building a Multiple Linear Regression model and covers common techniques to address issues that might come up along the way. I hope you enjoy reviewing it as much as I enjoyed building it!