The post Unleashing the Power of Linear Regression in Supervised Learning appeared first on The Official Blog of Adam DiStefano, M.S., CISSP.
]]>What is Linear Regression?
Linear regression is a linear approach to modeling the relationship between a dependent variable and one or more independent variables. It aims to find the best-fitting line that describes the relationship between the input features (independent variables) and the target output (dependent variable). The primary goal of linear regression is to minimize the difference between the actual output and the predicted output, thereby reducing the prediction error.
The Role of Linear Regression in Supervised Learning
Supervised learning is a type of machine learning where the algorithm is trained on a labeled dataset, meaning each data point in the training dataset has a known output value. Linear regression is an essential supervised learning technique used for various purposes, such as:
To demonstrate the power of linear regression, let’s walk through a simple example by build a linear regression model to predict the prices of used cars in India, and generate a set of insights and recommendations that will help the business.
Context
There is a huge demand for used cars in the Indian Market today. As sales of new cars have slowed down in the recent past, the pre-owned car market has continued to grow over the past years and is larger than the new car market now. Cars4U is a budding tech start-up that aims to find footholds in this market.
In 2018-19, while new car sales were recorded at 3.6 million units, around 4 million second-hand cars were bought and sold. There is a slowdown in new car sales and that could mean that the demand is shifting towards the pre-owned market. In fact, some car sellers replace their old cars with pre-owned cars instead of buying new ones.
Unlike new cars, where price and supply are fairly deterministic and managed by OEMs (Original Equipment Manufacturer / except for dealership level discounts which come into play only in the last stage of the customer journey), used cars are very different beasts with huge uncertainty in both pricing and supply. Keeping this in mind, the pricing scheme of these used cars becomes important in order to grow in the market. As a senior data scientist at Cars4U, you have to come up with a pricing model that can effectively predict the price of used cars and can help the business in devising profitable strategies using differential pricing. For example, if the business knows the market price, it will never sell anything below it.
Objective
To explore and visualize the dataset, build a linear regression model to predict the prices of used cars, and generate a set of insights and recommendations that will help the business.
Data Description
The data contains the different attributes of used cars sold in different locations. The detailed data dictionary is given below.
Data Dictionary
We will start by following this methodology:
The dataset used to build this model can be found by visiting my GitHub page (by clicking the like here).
# Libraries to help with reading and manipulating data import numpy as np import pandas as pd # Libraries to help with data visualization import matplotlib.pyplot as plt %matplotlib inline import seaborn as sns sns.set() # Removes the limit for the number of displayed columns pd.set_option("display.max_columns", None) # Sets the limit for the number of displayed rows pd.set_option("display.max_rows", 200) #Train/Test/Split from sklearn.model_selection import train_test_split # Sklearn package's randomized data splitting function #Sklearn libraries from sklearn.linear_model import LinearRegression from sklearn.preprocessing import PolynomialFeatures from sklearn import linear_model from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score from sklearn.preprocessing import OneHotEncoder #Show all columns and randomize the row display pd.set_option('display.max_columns', None) pd.set_option('display.max_rows', 200)
This project was coded using Google Colab. The data was read directly from Google Drive.
#mount and connect Google Drive from google.colab import drive drive.mount('/content/drive') #Import dataset "used_cars_data.csv" data = pd.read_csv('/content/drive/My Drive/Colab Notebooks/used_cars_data.csv')
Data preprocessing is a crucial initial step in the machine learning process, aimed at providing a comprehensive understanding of the dataset at hand. By investigating the underlying structure, patterns, and relationships within the data, the analysis allows practitioners to make informed decisions about feature selection, model choice, and potential preprocessing requirements.
This process often involves techniques such as data visualization, summary statistics, and correlation analysis to identify trends, detect outliers, and assess data quality. Gaining insights through data exploratory analysis not only helps in uncovering hidden relationships and nuances in the data but also aids in hypothesis generation and model validation. Ultimately, a thorough exploratory analysis sets the stage for building more accurate and reliable machine learning models, ensuring that the data-driven insights derived from these models are both meaningful and actionable.
Review the Dataset
#Sample of (10) rows data.sample(10)
Next, we will look at the shape of the dataset:
#Number of rows and columns print(f'Number of rows: {data.shape[0]} and Number of columns: {data.shape[1]}')
We see from reviewing the shape that the dataset contains 7,253 rows and 14 columns. Additionally, we see that the index column is identical to the S. No column so we can drop this as it does not offer any value in our model:
#Drop S.No. column data.drop(['S.No.'], axis=1, inplace=True) data.reset_index(inplace=True, drop=True)
Next, review the datatypes:
#Review the datatypes data.info()
The dataset contains the following datatypes:
The following columns are missing data:
We can also conduct a statistical analysis on the dataset by running:
#Statistical analysis of dataset data.describe().T
The results return the following:
Year
Kilometers_Drive
Seats
New_Price
Price
When checking for duplicates, we found there were three duplicated rows in the dataset. Since these do not add any additional value, we will move forward by eliminating these rows.
#Check for duplicates data.duplicated().sum() #Dropping duplicated rows data.drop_duplicates(keep ='first',inplace = True) #Confirm duplicated are removed data.duplicated().sum()
We are now ready to move to univariate analysis. We will start with the name column. Right off the bat, it was noticed that the dataset contains both the make and model names of the cars. For this analysis, we have elected to drop the model (Names) from our analysis.
#Create a new column of make by separating it from the name
data['Make'] = data['Name'].str.split(' ').str[0]
#Dropping name column
data.drop(['Name'], axis = 1, inplace=True) data.reset_index(inplace=True, drop=True)
Next, we will convert this datatype from an object to a category datatype:
#Convert make column from object to category data['Make'] = data['Make'].astype('category', errors = 'raise') #Confirm datatype data['Make'].dtype
Let’s evaluate the breakdown of each make by counting each and storing them in a new data frame:
#How many values for each make pd.DataFrame(data[['Make']].value_counts(ascending=False))
One thing that was noticed is that there are two categories for the make Isuzu. Let’s consolidate this into a single make:
#Consolidate make Isuzu into one category data.loc[data['Make'] == 'ISUZU','Make'] = 'Isuzu' data['Make']= data['Make'].cat.remove_categories('ISUZU')
To visualize the make category breakdown:
#Countplot of the make column plt.figure(figsize = (30,8)) ax = sns.countplot(x = 'Make', data = data) ax.set_xticklabels(ax.get_xticklabels(), rotation = 90);
The top five makes based on the results are:
Let’s now explore the price data. The first thing we validated is whether or not there were NULL values in the price category. After evaluation, we identified 1,233 values that were missing. To fix this, we replaced the NULL values with the median price of the cars.
#Missing data for price data['Price'].isnull().sum() #Replace NaN values in the price column with the median data['Price'] = pd.DataFrame(data['Price'].fillna(int(data['Price'].median())))
When looking at a frequency dataframe, we see that the most common price identified was 5 lakhs (or approximately $6,115 USD).
#Review the price breakdown pd.set_option('display.max_rows', 10) pd.DataFrame(data['Price'].value_counts(ascending=False))
We also were able to conduct a statistical analysis to find the prices range from 0.44 – 160 lakhs with a mean price is 8.72.
#Statistical analysis of price pd.DataFrame(data['Price']).describe().T
Here is a breakdown of the average price of the cars by make:
#Average price of cars by make avg_price = data.groupby(['Make'])['Price'].mean().fillna(0).sort_values(ascending= False).index #catplot of make and price sns.catplot(x = "Make", y = "Price", data = data, kind = 'bar', height = 7, aspect = 2, order = avg_price).set(title = 'Price by Make') plt.xticks(rotation=90);
It is interesting to note the difference between the average cost of new cars of the same make and the used cars available at Cars4U:
#Average new price of cars by make avg_new_price = data.groupby(['Make'])['New_Price'].mean().fillna(0).sort_values(ascending= False).index #catplot of make and new price sns.catplot(x = "Make", y = "New_Price", data = data, kind = 'bar', height = 7, aspect = 2, order = avg_new_price ).set(title = 'New Price by Make') plt.xticks(rotation=90);
We can see that there is a moderate positive correlation between the price of a new car and the price of the cars at Cars4U:
#Correlation between price and new price data[['New_Price', 'Price']].corr()
Next, we converted the transmission data to categorical data and reviewed the breakdown between automatic and manual transmission cars:
#Convert Transmission column from object to category
data['Transmission'] = data['Transmission'].astype('category', errors = 'raise')
#Displot of the transmission column
plt.figure(figsize = (8,8))
sns.displot(x = 'Transmission', data = data);
#Specific value counts for each transmission types
pd.DataFrame(data[‘Transmission’].value_counts(ascending=False))
As we see from the distribution plot below, manual transmission cars account 71.8% of the cars – far more than automatic transmission cars at Cars4U.
When evaluating the average cost of the cars with manual transmissions for new and used cars, we identified a 44.3% difference in prices:
#Average price of cars by make with manual transmissions man_price = data.groupby(['Make'])['Price'].mean().fillna(0).sort_values(ascending= False).index #catplot of make and price for all manual transmissions sns.catplot(x = "Make", y = "Price", data = manual, kind = 'bar', height = 7, aspect = 2, order = man_price).set(title = 'Price of Manual Make Cars') plt.xticks(rotation=90); #Average new price of cars by make with manual transmissions man_cars = data.groupby(['Make'])['New_Price'].mean().fillna(0).sort_values(ascending= False).index #catplot of make and price for all manual transmissions sns.catplot(x = "Make", y = "New_Price", data = manual, kind='bar', height=7, aspect=2, order= man_cars).set(title = 'New Price by Manual Make Cars') plt.xticks(rotation=90); #Difference between the mean price and mean new price of manual cars manual['Price'].mean()/manual['New_Price'].mean()
It is interesting to note that there is a smaller difference in price between used and new car prices for cars with automatic transmissions – a difference of only 38.7%.
#Average price of cars by make with automatic transmissions auto_price = data.groupby(['Make'])['Price'].mean().fillna(0).sort_values(ascending= False).index #catplot of make and price for all automatic transmissions sns.catplot(x = "Make", y = "Price", data = automatic, kind = 'bar', height = 7, aspect = 2, order = auto_price).set(title = 'Price of Automatic Make Cars') plt.xticks(rotation=90); #Average new price of cars by make automatic transmissions new_auto = data.groupby(['Make'])['New_Price'].mean().fillna(0).sort_values(ascending= False).index #catplot of make and new price for all automatic transmissions sns.catplot(x = "Make", y = "New_Price", data = automatic, kind = 'bar', height = 7, aspect = 2, order = new_auto).set(title = 'New Price of Automatic Make Cars') plt.xticks(rotation=90); #Difference between the mean price and mean new price of automatic cars automatic['Price'].mean()/automatic['New_Price'].mean()
There are other features that we can explore in our exploratory data analysis (all of which you can view on the GitHub repo found here, but we will now evaluate the correlation between all these features to help identify the strength of their relationships. One thing that is important to keep in mind when completing the data analysis is the ensure that all features containing NaN or have no data are either dropped or imputed. It is also important to treat any outliers that could potential skew your dataset and have an adverse impact on your model metrics. For example, the power feature contained a number of outliers that we treated by first converting them to NaN values with NumPy and replacing them with the median central tendency:
#Treating the outliers for power power_outliers = [340., 360., 362.07, 362.9, 364.9, 367., 382., 387.3, 394.3, 395., 402., 421., 444., 450., 488.1, 500., 503., 550., 552., 560., 616.] data['Power_Outliers'] = data['Power'] #Replacing the power values with np.nan for outlier in power_outliers: data.loc[data['Power_Outliers'] == outlier, 'Power_Outliers'] = np.nan data['Power_Outliers'].isnull().sum() #Group the outliers by Make and impute with median data['Power_Outliers'] = data.groupby(['Make'])['Power_Outliers'].apply(lambda fix : fix.fillna(fix.median())) data['Power_Outliers'].isnull().sum() #Transfer new data back to original column data['Power'] = data['Power_Outliers'] #Drop Power_Outliers since it is no longer needed data.drop(['Power_Outliers'], axis=1, inplace=True) data.reset_index(inplace=True, drop=True)
You could also choose to drop missing data if the dataset is large enough, however, this should be done with caution as to not impact the results of your models as this could lead to underfitting. Underfitting occurs when a machine learning model fails to capture the underlying patterns in the data, resulting in poor performance on both the training set and the test set. This usually happens when the model is too simple, or when there is not enough data to train the model effectively. To avoid underfitting, it’s important to ensure that your dataset is large enough and diverse enough to capture the complexities of the problem you’re trying to solve. Additionally, use an appropriate model complexity that is neither too simple nor too complex for your data. You can also leverage techniques like cross-validation to get a better estimate of your model’s performance on unseen data.
Below is a pair plot that highlights the strength of the relationships for all possible bivariate relationships:
Here is a heat map of the correlations represented above:
To better improve our model. we have performed log transformations on our price feature. Log transformations are a common preprocessing technique used in machine learning to modify the distribution of data features. They can be particularly useful when dealing with data that has a skewed distribution, as log transformations can help make the data more normally distributed, which can improve the performance of some machine learning algorithms. The main reasons for using log transformations are:
Keep in mind that log transformations are not suitable for all types of data, particularly data with negative values or zero, as the logarithm is undefined for these values. Additionally, it’s essential to consider the specific machine learning algorithm and the nature of the data before deciding whether to apply a log transformation or another preprocessing technique. Below was the log transformation performed on our price feature:
#Create log transformation columns data['Price_Log'] = np.log(data['Price']) data['New_Price_Log'] = np.log(data['New_Price']) data.head()
Notice how the distribution is now much more balanced and naturally distributed:
The last step in our data preprocessing step is to use one-hot encoding on our categorical variables.
One-Hot Encoding is a technique used in machine learning to convert categorical variables into a binary representation that can be easily understood and processed by machine learning algorithms. Categorical variables are those that take on a limited number of distinct categories or levels, such as gender, color, or type of car. Most machine learning algorithms require numerical input, so converting categorical variables into a numerical format is a crucial preprocessing step.
The one-hot encoding process involves creating new binary features for each unique category in a categorical variable. Each new binary feature represents a specific category and takes the value 1 if the original variable’s value is equal to that category, and 0 otherwise. Here’s a step-by-step explanation of the one-hot encoding process:
For example, let’s say you have a dataset with a categorical variable ‘Color’ that has three unique categories: Red, Blue, and Green. To apply one-hot encoding, you would create three new binary features: ‘Color_Red’, ‘Color_Blue’, and ‘Color_Green’. If an instance in the dataset has the value ‘Red’ for the original ‘Color’ variable, then the binary features would be set as follows: ‘Color_Red’ = 1, ‘Color_Blue’ = 0, and ‘Color_Green’ = 0.
The advantages of using this technique are:
There are some drawbacks of one-hot encoding as well. These include:
To mitigate these drawbacks, you can consider using other encoding techniques, such as target encoding or ordinal encoding, depending on the specific nature of the categorical variables and the machine learning algorithm being used, however for this model, one-hot encoding is our best option.
#One-hot encoding our variables data = pd.get_dummies(data, columns=['Location', 'Fuel_Type','Transmission','Owner_Type','Make'], drop_first=True)
#Select Independent and Dependent Variables a = data1.drop(['Price'], axis=1) b = data1["Price"]
#Splitting the data in 70:30 ratio for train to test data
a_train, a_test, b_train, b_test = train_test_split(a, b, test_size=0.30, random_state=1)
#View split
print(“Number of rows in train data =”, a_train.shape[0]) print(“Number of rows in test data =”, a_test.shape[0])
#Fit model_one model_one = LinearRegression() model_one.fit(a_train, b_train)
We can now evaluate the model performance on both the training and the testing dataset. In evaluating a supervised learning model using linear regression, there are several metrics that can be used to measure its performance. However, the most commonly used and valuable metric is the Root Mean Squared Error (RMSE).
RMSE is calculated as the square root of the mean of the squared differences between the predicted and actual values. It provides an estimate of the average error in the predictions and is particularly useful because it is in the same units as the target variable. A lower RMSE value indicates a better fit of the model to the data.
Other metrics that can be used to evaluate a linear regression model include Mean Absolute Error (MAE), Mean Squared Error (MSE), and R-squared (R²), but RMSE is often preferred due to its interpretability and sensitivity to larger errors in the predictions.
#Checking model performance on train set print("Training Performance") print('\n') training_perfomace_1 = model_performance_regression(model_one, a_train, b_train) training_perfomace_1 #Checking model performance on test set print("Test Performance") print("\n") test_performance_1 = model_performance_regression(model_one, a_test, b_test) test_performance_1
Next, we will evaluate the coefficients and intercept of our first model. The coefficients and intercepts play a crucial role in understanding the relationship between the input features and the target variable. Evaluating the coefficients and intercepts provides insights into the model’s behavior and helps in interpreting the results. Since the coefficients of a linear regression model represent the strength and direction of the relationship between each independent variable and the dependent variable, a positive coefficient indicates that as the feature value increases, the target variable also increases, while a negative coefficient suggests the opposite. The intercept represents the expected value of the target variable when all the independent variables are zero.
By examining the coefficients and intercept, we can better understand the relationships between the variables and how they contribute to the model’s predictions. Additionally, evaluating the coefficients can help us determine the relative importance of each feature in the model. Features with higher absolute coefficients have a more significant impact on the target variable, while features with lower absolute coefficients have a smaller impact. This can help in feature selection and reducing model complexity by eliminating less important features.
#Coefficients and intercept of model_one coef_data_1 = pd.DataFrame(np.append(model_one.coef_, model_one.intercept_), index=a_train.columns.tolist() + ["Intercept"], columns=["Coefficients"],) coef_data_1
#Evaluation of Feature Importance imp_1 = pd.DataFrame(data={ 'Attribute': a_train.columns, 'Importance': model_one.coef_ }) imp_1 = imp_1.sort_values(by='Importance', ascending=False) imp_1
The output of a supervised learning linear regression mode represents the predicted value of the target variable based on the input features. Linear regression models establish a linear relationship between the input features and the target variable by estimating coefficients for each input feature and an intercept term.
A linear regression model can be represented by the following equation: y = β0 + β1 * x1 + β2 * x2 + … + βn * xn + ε
Where:
#Equation of linear regression equation_one = "Price = " + str(model_one.intercept_) print(equation_one, end=" ") for i in range(len(a_train.columns)): if i != len(a_train.columns) - 1: print("+ (", model_one.coef_[i],")*(", a_train.columns[i],")",end=" ",) else: print("+ (", model_one.coef_[i], ")*(", a_train.columns[i], ")")
Lastly, we will evaluate the PolynomialFeatures transformation to capture non-linear relationships between input features and the target variable. By introducing polynomial features, we can model these non-linear relationships and improve the performance of the linear regression model.
PolynomialFeatures transformation works by generating new features from the original input features through polynomial combinations of the original features up to a specified degree. For example, if the original features are [x1, x2], and the specified degree is 2, the transformed features would be [1, x1, x2, x1^2, x1*x2, x2^2].
#PolynomialFeatures Transformation poly = PolynomialFeatures(degree=2, interaction_only=True) a_train2 = poly.fit_transform(a_train) a_test2 = poly.fit_transform(a_test) poly_clf = linear_model.LinearRegression() poly_clf.fit(a_train2, b_train) print(poly_clf.score(a_train2, b_train))
The polynomial transformation improved the model from .79 to .97.
These ten models (to see the remaining nine models, check out my notebook on GitHub) helped us to identify some key takeaways and recommendations for the business.
Lower end cars had more of a negative impact on the price. Dealerships should look for more mid-ranged valued cars for more of an impact on sales.
Another key point is that while the majority of the cars in the dataset are of petrol and diesel fuel types, electric cars had a positive effect on the price model. This is a good opportunity for dealers to start offering more selections in the electric car market – especially since fuel prices continue to rise.
In many of the models built, Location_Kolkata had a negative effect on price. Furthermore, we also observed there was a good correlation between price and new price. Given this relationship, it is wise for the dealerships to understand that as the price of new cars get higher, used car prices can also increase. Secondly, both the mileage and kilometers driven have an inverse relationship – as the mileage and kilometers increase, the price drops. This makes sense as buyers are seeking cars that offer km/kg and have less mileage. Customers should expect to pay more for these cars.
The recommendations are pragmatic. The best performing model used the log of price. In reality, this will mean nothing to the sales people. Dealers should look to:
The post Unleashing the Power of Linear Regression in Supervised Learning appeared first on The Official Blog of Adam DiStefano, M.S., CISSP.
]]>