


This notebook covers the step by step process of building a Machine Learning model to predict the House price. As in a standard machine learning workflow, the process is divided into the following steps:

  1. Understanding the Problem;
  2. Exploratory Data Analysis;
  3. Data Preprocessing;
  4. Feature Selection;
  5. Modeling;
  6. Evaluation.

Understanding the problem

A dataset was given which has 79 explanatory variables describing (almost) every aspect of residential homes. Our task is to find prices of new houses using Machine Learning.

# basic library import section
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
pd.set_option('display.max_rows', None) # display all rows
%matplotlib inline
import matplotlib_inline
# loading our two required dataset
train = pd.read_csv(f'./house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv(f'./house-prices-advanced-regression-techniques/test.csv')
train.shape, test.shape
((1460, 81), (1459, 80))
Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition SalePrice
0 1 60 RL 65.0 8450 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 208500
1 2 20 RL 80.0 9600 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 5 2007 WD Normal 181500
2 3 60 RL 68.0 11250 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 223500
3 4 70 RL 60.0 9550 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 2 2006 WD Abnorml 140000
4 5 60 RL 84.0 14260 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 12 2008 WD Normal 250000
5 6 50 RL 85.0 14115 Pave NaN IR1 Lvl AllPub ... 0 NaN MnPrv Shed 700 10 2009 WD Normal 143000
6 7 20 RL 75.0 10084 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 WD Normal 307000
7 8 60 RL NaN 10382 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Shed 350 11 2009 WD Normal 200000
8 9 50 RM 51.0 6120 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 4 2008 WD Abnorml 129900
9 10 190 RL 50.0 7420 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 1 2008 WD Normal 118000
10 11 20 RL 70.0 11200 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 2 2008 WD Normal 129500
11 12 60 RL 85.0 11924 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 7 2006 New Partial 345000
12 13 20 RL NaN 12968 Pave NaN IR2 Lvl AllPub ... 0 NaN NaN NaN 0 9 2008 WD Normal 144000
13 14 20 RL 91.0 10652 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN NaN 0 8 2007 New Partial 279500
14 15 20 RL NaN 10920 Pave NaN IR1 Lvl AllPub ... 0 NaN GdWo NaN 0 5 2008 WD Normal 157000
15 16 45 RM 51.0 6120 Pave NaN Reg Lvl AllPub ... 0 NaN GdPrv NaN 0 7 2007 WD Normal 132000
16 17 20 RL NaN 11241 Pave NaN IR1 Lvl AllPub ... 0 NaN NaN Shed 700 3 2010 WD Normal 149000
17 18 90 RL 72.0 10791 Pave NaN Reg Lvl AllPub ... 0 NaN NaN Shed 500 10 2006 WD Normal 90000
18 19 20 RL 66.0 13695 Pave NaN Reg Lvl AllPub ... 0 NaN NaN NaN 0 6 2008 WD Normal 159000
19 20 20 RL 70.0 7560 Pave NaN Reg Lvl AllPub ... 0 NaN MnPrv NaN 0 5 2009 COD Abnorml 139000

20 rows × 81 columns

Let's print a concise summary of our train DataFrame.

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
 #   Column         Non-Null Count  Dtype
---  ------         --------------  -----
 0   Id             1460 non-null   int64
 1   MSSubClass     1460 non-null   int64
 2   MSZoning       1460 non-null   object
 3   LotFrontage    1201 non-null   float64
 4   LotArea        1460 non-null   int64
 5   Street         1460 non-null   object
 6   Alley          91 non-null     object
 7   LotShape       1460 non-null   object
 8   LandContour    1460 non-null   object
 9   Utilities      1460 non-null   object
 10  LotConfig      1460 non-null   object
 11  LandSlope      1460 non-null   object
 12  Neighborhood   1460 non-null   object
 13  Condition1     1460 non-null   object
 14  Condition2     1460 non-null   object
 15  BldgType       1460 non-null   object
 16  HouseStyle     1460 non-null   object
 17  OverallQual    1460 non-null   int64
 18  OverallCond    1460 non-null   int64
 19  YearBuilt      1460 non-null   int64
 20  YearRemodAdd   1460 non-null   int64
 21  RoofStyle      1460 non-null   object
 22  RoofMatl       1460 non-null   object
 23  Exterior1st    1460 non-null   object
 24  Exterior2nd    1460 non-null   object
 25  MasVnrType     1452 non-null   object
 26  MasVnrArea     1452 non-null   float64
 27  ExterQual      1460 non-null   object
 28  ExterCond      1460 non-null   object
 29  Foundation     1460 non-null   object
 30  BsmtQual       1423 non-null   object
 31  BsmtCond       1423 non-null   object
 32  BsmtExposure   1422 non-null   object
 33  BsmtFinType1   1423 non-null   object
 34  BsmtFinSF1     1460 non-null   int64
 35  BsmtFinType2   1422 non-null   object
 36  BsmtFinSF2     1460 non-null   int64
 37  BsmtUnfSF      1460 non-null   int64
 38  TotalBsmtSF    1460 non-null   int64
 39  Heating        1460 non-null   object
 40  HeatingQC      1460 non-null   object
 41  CentralAir     1460 non-null   object
 42  Electrical     1459 non-null   object
 43  1stFlrSF       1460 non-null   int64
 44  2ndFlrSF       1460 non-null   int64
 45  LowQualFinSF   1460 non-null   int64
 46  GrLivArea      1460 non-null   int64
 47  BsmtFullBath   1460 non-null   int64
 48  BsmtHalfBath   1460 non-null   int64
 49  FullBath       1460 non-null   int64
 50  HalfBath       1460 non-null   int64
 51  BedroomAbvGr   1460 non-null   int64
 52  KitchenAbvGr   1460 non-null   int64
 53  KitchenQual    1460 non-null   object
 54  TotRmsAbvGrd   1460 non-null   int64
 55  Functional     1460 non-null   object
 56  Fireplaces     1460 non-null   int64
 57  FireplaceQu    770 non-null    object
 58  GarageType     1379 non-null   object
 59  GarageYrBlt    1379 non-null   float64
 60  GarageFinish   1379 non-null   object
 61  GarageCars     1460 non-null   int64
 62  GarageArea     1460 non-null   int64
 63  GarageQual     1379 non-null   object
 64  GarageCond     1379 non-null   object
 65  PavedDrive     1460 non-null   object
 66  WoodDeckSF     1460 non-null   int64
 67  OpenPorchSF    1460 non-null   int64
 68  EnclosedPorch  1460 non-null   int64
 69  3SsnPorch      1460 non-null   int64
 70  ScreenPorch    1460 non-null   int64
 71  PoolArea       1460 non-null   int64
 72  PoolQC         7 non-null      object
 73  Fence          281 non-null    object
 74  MiscFeature    54 non-null     object
 75  MiscVal        1460 non-null   int64
 76  MoSold         1460 non-null   int64
 77  YrSold         1460 non-null   int64
 78  SaleType       1460 non-null   object
 79  SaleCondition  1460 non-null   object
 80  SalePrice      1460 non-null   int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB

descriptive statistics of our train set

count mean std min 25% 50% 75% max
Id 1460.0 730.500000 421.610009 1.0 365.75 730.5 1095.25 1460.0
MSSubClass 1460.0 56.897260 42.300571 20.0 20.00 50.0 70.00 190.0
LotFrontage 1201.0 70.049958 24.284752 21.0 59.00 69.0 80.00 313.0
LotArea 1460.0 10516.828082 9981.264932 1300.0 7553.50 9478.5 11601.50 215245.0
OverallQual 1460.0 6.099315 1.382997 1.0 5.00 6.0 7.00 10.0
OverallCond 1460.0 5.575342 1.112799 1.0 5.00 5.0 6.00 9.0
YearBuilt 1460.0 1971.267808 30.202904 1872.0 1954.00 1973.0 2000.00 2010.0
YearRemodAdd 1460.0 1984.865753 20.645407 1950.0 1967.00 1994.0 2004.00 2010.0
MasVnrArea 1452.0 103.685262 181.066207 0.0 0.00 0.0 166.00 1600.0
BsmtFinSF1 1460.0 443.639726 456.098091 0.0 0.00 383.5 712.25 5644.0
BsmtFinSF2 1460.0 46.549315 161.319273 0.0 0.00 0.0 0.00 1474.0
BsmtUnfSF 1460.0 567.240411 441.866955 0.0 223.00 477.5 808.00 2336.0
TotalBsmtSF 1460.0 1057.429452 438.705324 0.0 795.75 991.5 1298.25 6110.0
1stFlrSF 1460.0 1162.626712 386.587738 334.0 882.00 1087.0 1391.25 4692.0
2ndFlrSF 1460.0 346.992466 436.528436 0.0 0.00 0.0 728.00 2065.0
LowQualFinSF 1460.0 5.844521 48.623081 0.0 0.00 0.0 0.00 572.0
GrLivArea 1460.0 1515.463699 525.480383 334.0 1129.50 1464.0 1776.75 5642.0
BsmtFullBath 1460.0 0.425342 0.518911 0.0 0.00 0.0 1.00 3.0
BsmtHalfBath 1460.0 0.057534 0.238753 0.0 0.00 0.0 0.00 2.0
FullBath 1460.0 1.565068 0.550916 0.0 1.00 2.0 2.00 3.0
HalfBath 1460.0 0.382877 0.502885 0.0 0.00 0.0 1.00 2.0
BedroomAbvGr 1460.0 2.866438 0.815778 0.0 2.00 3.0 3.00 8.0
KitchenAbvGr 1460.0 1.046575 0.220338 0.0 1.00 1.0 1.00 3.0
TotRmsAbvGrd 1460.0 6.517808 1.625393 2.0 5.00 6.0 7.00 14.0
Fireplaces 1460.0 0.613014 0.644666 0.0 0.00 1.0 1.00 3.0
GarageYrBlt 1379.0 1978.506164 24.689725 1900.0 1961.00 1980.0 2002.00 2010.0
GarageCars 1460.0 1.767123 0.747315 0.0 1.00 2.0 2.00 4.0
GarageArea 1460.0 472.980137 213.804841 0.0 334.50 480.0 576.00 1418.0
WoodDeckSF 1460.0 94.244521 125.338794 0.0 0.00 0.0 168.00 857.0
OpenPorchSF 1460.0 46.660274 66.256028 0.0 0.00 25.0 68.00 547.0
EnclosedPorch 1460.0 21.954110 61.119149 0.0 0.00 0.0 0.00 552.0
3SsnPorch 1460.0 3.409589 29.317331 0.0 0.00 0.0 0.00 508.0
ScreenPorch 1460.0 15.060959 55.757415 0.0 0.00 0.0 0.00 480.0
PoolArea 1460.0 2.758904 40.177307 0.0 0.00 0.0 0.00 738.0
MiscVal 1460.0 43.489041 496.123024 0.0 0.00 0.0 0.00 15500.0
MoSold 1460.0 6.321918 2.703626 1.0 5.00 6.0 8.00 12.0
YrSold 1460.0 2007.815753 1.328095 2006.0 2007.00 2008.0 2009.00 2010.0
SalePrice 1460.0 180921.195890 79442.502883 34900.0 129975.00 163000.0 214000.00 755000.0

Feature Selection

Seperating numaric columns and categorical columns

numeric_cols = train.select_dtypes(exclude=['object'])
categorical_cols = train.select_dtypes(include=['object'])
#finding important features
correlation_num = numeric_cols.corr()
correlation_num.sort_values(["SalePrice"], ascending = False, inplace = True)
SalePrice        1.000000
OverallQual      0.790982
GrLivArea        0.708624
GarageCars       0.640409
GarageArea       0.623431
TotalBsmtSF      0.613581
1stFlrSF         0.605852
FullBath         0.560664
TotRmsAbvGrd     0.533723
YearBuilt        0.522897
YearRemodAdd     0.507101
GarageYrBlt      0.486362
MasVnrArea       0.477493
Fireplaces       0.466929
BsmtFinSF1       0.386420
LotFrontage      0.351799
WoodDeckSF       0.324413
2ndFlrSF         0.319334
OpenPorchSF      0.315856
HalfBath         0.284108
LotArea          0.263843
BsmtFullBath     0.227122
BsmtUnfSF        0.214479
BedroomAbvGr     0.168213
ScreenPorch      0.111447
PoolArea         0.092404
MoSold           0.046432
3SsnPorch        0.044584
BsmtFinSF2      -0.011378
BsmtHalfBath    -0.016844
MiscVal         -0.021190
Id              -0.021917
LowQualFinSF    -0.025606
YrSold          -0.028923
OverallCond     -0.077856
MSSubClass      -0.084284
EnclosedPorch   -0.128578
KitchenAbvGr    -0.135907
Name: SalePrice, dtype: float64
# encode categorical columns
from sklearn.preprocessing import LabelEncoder
cat_le = categorical_cols.apply(LabelEncoder().fit_transform)
cat_le['SalePrice'] = train['SalePrice']
# find important features from categorical values
correlation_cat = cat_le.corr()
correlation_cat.sort_values(["SalePrice"], ascending = False, inplace = True)
SalePrice        1.000000
Foundation       0.382479
CentralAir       0.251328
Electrical       0.234716
PavedDrive       0.231357
RoofStyle        0.222405
SaleCondition    0.213092
Neighborhood     0.210851
HouseStyle       0.180163
Fence            0.140640
Alley            0.139868
RoofMatl         0.132383
ExterCond        0.117303
Functional       0.115328
Exterior2nd      0.103766
Exterior1st      0.103551
Condition1       0.091155
MiscFeature      0.073609
LandSlope        0.051152
Street           0.041036
MasVnrType       0.029658
GarageCond       0.025149
LandContour      0.015453
BsmtCond         0.015058
BsmtFinType2     0.008041
Condition2       0.007513
GarageQual       0.006861
Utilities       -0.014314
SaleType        -0.054911
LotConfig       -0.067396
BldgType        -0.085591
Heating         -0.098812
BsmtFinType1    -0.103114
PoolQC          -0.126070
MSZoning        -0.166872
LotShape        -0.255580
BsmtExposure    -0.309043
HeatingQC       -0.400178
GarageType      -0.415283
FireplaceQu     -0.459605
GarageFinish    -0.549247
KitchenQual     -0.589189
BsmtQual        -0.620886
ExterQual       -0.636884
Name: SalePrice, dtype: float64

Visualizing important features

fig, axarr = plt.subplots(2, 1, figsize = (14, 18))
axarr[0].set_title("Feature importance of numaric columns")
axarr[1].set_title("Feature importance of categorical columns");


Data Preprocessing (Train set)

# dropping colmuns where number of null values is greater than 500
null_values = train.loc[:, train.isnull().sum() > 500]
train.drop(null_values, axis = 1, inplace = True)

Let's remove features with less importance. Less important features was selected by a smaller absolute value of correlation score.

# list of less important features
less_important = ['Id', 'MSSubClass', 'OverallCond', 'BsmtFinSF2', 'LowQualFinSF', 'BsmtHalfBath', 'KitchenAbvGr', 'EnclosedPorch',
 '3SsnPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
 'LotConfig', 'LandSlope', 'Condition2', 'BldgType', 'MasVnrType', 'ExterQual', 'BsmtQual', 'BsmtExposure','BsmtFinType1',
 'Heating', 'HeatingQC', 'KitchenQual', 'GarageType', 'GarageFinish','SaleType']

# dropping less important columns
train.drop(less_important, axis = 1, inplace = True)

Let's check for null values.

LotFrontage 259
LotArea 0
Neighborhood 0
Condition1 0
HouseStyle 0
OverallQual 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 0
Exterior2nd 0
MasVnrArea 8
ExterCond 0
Foundation 0
BsmtCond 37
BsmtFinSF1 0
BsmtFinType2 38
BsmtUnfSF 0
TotalBsmtSF 0
CentralAir 0
Electrical 1
1stFlrSF 0
2ndFlrSF 0
GrLivArea 0
BsmtFullBath 0
FullBath 0
HalfBath 0
BedroomAbvGr 0
TotRmsAbvGrd 0
Functional 0
Fireplaces 0
GarageYrBlt 81
GarageCars 0
GarageArea 0
GarageQual 81
GarageCond 81
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
ScreenPorch 0
SaleCondition 0
SalePrice 0
# filling null values
# for numerical columns, we will fill null values with mean of the rest of the column
train['LotFrontage'].fillna(train['LotFrontage'].mean(), inplace = True)
# for categorical columns, we will fill null values with stardard mode of that column
train['MasVnrArea'].fillna(0 , inplace = True)
train['BsmtCond'].fillna('NA' , inplace = True)
train['BsmtFinType2'].fillna('NA' , inplace = True)
train['Electrical'].fillna('SBrkr' , inplace = True)
train['GarageYrBlt'].fillna(0 , inplace = True)
train['GarageQual'].fillna('NA' , inplace = True)
train['GarageCond'].fillna('NA' , inplace = True)


The overall quality of a house is a very important factor of the house price as indicated by the correlation value.

plt.scatter(train.OverallQual, train.SalePrice, marker ="^")


It appear to be very correlated to the sales price, and we'll explore it again later.

Accroding to the correlation as well as our natural sense, ground living area is one of the major factor of the sale price. Let's first plot a scatter plot visualizing GrLivArea and SalePrice

plt.scatter(train.GrLivArea, train.SalePrice, c = "lightcoral", s=10)


There are some aberrating values that saying Ground living area is bigger than 4000. We will treat them as outliers.

# removing outliers
train = train[train.GrLivArea < 4000]

Let's plot another scatter plot visualizing LotArea and SalePrice

plt.scatter(train.LotArea, train.SalePrice, c = "chocolate", s=10)


We will consider LotArea greater than 150000 as outliers.

#removing outliers
train = train[train.LotArea < 150000]

Another scatter plot of LotFrontage w.r.t SalePrice

plt.scatter(train.LotFrontage, train.SalePrice, c = "green", s=10)


OverallQual refers overall quality of the house. This is a important feature. SalePrice largly depends on it. Because if the house quality is Very Excellent than it is more likely to be sold with high price. Let's analyse this column.

labels = 'Average', 'Above Average', 'Good', 'Very Good', 'Below Average','Excellent', 'Fair', 'Very Excellent', 'Poor', 'Very Poor'
explode = (0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.2, 0.3, 0.5, 0.7)

fig1, ax1 = plt.subplots()
ax1.pie(train['OverallQual'].value_counts(), explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=30)


27.3% houses of the train dataset has Average quality. And 0.1% houses are Very Poor in quality.

Let's see the SalePrice variation w.r.t OverallQual

fig = sns.barplot(x = 'OverallQual',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average', 'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'], rotation=90);


Foundation is another important feature. It represent how strong a bulding can be. Buildings life depends on it. So, this column worth analysing.

labels = 'Poured Contrete', 'Cinder Block', 'Brick & Tile', 'Slab', 'Stone', 'Wood'
explode = (0, 0.0, 0.0, 0.1, 0.3, 0.5)

fig1, ax1 = plt.subplots()
ax1.pie(train['Foundation'].value_counts(), explode=explode, labels=labels, autopct='%1.1f%%',
        shadow=True, startangle=30)


fig = sns.barplot(x = 'Foundation',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['Poured Contrete', 'Cinder Block', 'Brick & Tile', 'Wood', 'Slab', 'Stone'], rotation=45)
plt.xlabel("Types of Foundation");


Let's see how SalePrice varies w.r.t GarageCars

fig = sns.barplot(x = 'GarageCars',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['No car', '1 car', '2 cars', '3 cars', '4 cars'], rotation=45)
plt.xlabel("Number of cars in Garage");


fig = sns.barplot(x = 'Fireplaces',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['No Fireplace', '1 Fireplaces', '2 Fireplaces', '3 Fireplaces'], rotation=45)
plt.xlabel("Number of Fireplaces");


Let's plot a distribution plot of YearBuilt column representing the year of a house was bult w.r.t saleprice

sns.displot(x = 'YearBuilt', y = 'SalePrice', data = train);


corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
            square=True, linewidths=.5, cbar_kws={"shrink": .5});


Feature Engineering

Reducing Skewness and kurtosis from data.

sns.displot(x = 'LotArea', data = train, kde = True)
skewness = str(train["LotArea"].skew())
kurtosis = str(train["LotArea"].kurt())
plt.legend([skewness, kurtosis], title=("skewness and kurtosis"))
plt.title("Before applying transform technique")


#applying log transform
sns.displot(x = 'LotArea', data = train, kde = True)
skewness = str(train["LotArea"].skew())
kurtosis = str(train["LotArea"].kurt())
plt.legend([skewness, kurtosis],title=("skewness and kurtosis"))
plt.title("After applying transform technique")


sns.displot(x = 'GrLivArea', data = train, kde = True)
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("Before applying transform technique")


sns.displot(x = 'GrLivArea', data = train, kde = True)
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("After applying transform technique")


sns.displot(x = 'LotFrontage', data = train, kde = True)
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("Before applying transform technique")


train['LotFrontage'] = np.cbrt(train['LotFrontage'])
sns.displot(x = 'LotFrontage', data = train, kde = True)
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("After applying transform technique")



It's time to create our independent and dependent matrix of feature.

x = train.drop(['SalePrice'], axis = 1)
y = train['SalePrice']
#labelencoding categorical variables from x
from sklearn.preprocessing import LabelEncoder
x = x.apply(LabelEncoder().fit_transform)
LotFrontage LotArea Neighborhood Condition1 HouseStyle OverallQual YearBuilt YearRemodAdd RoofStyle RoofMatl ... GarageYrBlt GarageCars GarageArea GarageQual GarageCond PavedDrive WoodDeckSF OpenPorchSF ScreenPorch SaleCondition
0 36 327 5 2 5 6 104 53 1 0 ... 90 2 220 5 5 2 0 49 0 4
1 52 498 24 1 2 5 77 26 1 0 ... 63 2 149 5 5 2 187 0 0 4
2 39 702 5 2 5 6 102 52 1 0 ... 88 2 260 5 5 2 0 30 0 4
3 31 489 6 2 5 6 19 20 1 0 ... 85 3 284 5 5 2 0 24 0 0
4 56 925 15 2 5 7 101 50 1 0 ... 87 3 378 5 5 2 118 70 0 4

5 rows × 42 columns

0    208500
1    181500
2    223500
3    140000
4    250000
Name: SalePrice, dtype: int64
x.shape, y.shape
((1453, 42), (1453,))
#splitting the dataset into train and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 31)
len(x_train), len(x_test), len(y_train), len(y_test)
(1089, 364, 1089, 364)
# feature scaling with x = (x - mean(x)) / std(x)
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
x_train = scale.fit_transform(x_train)
x_test = scale.transform(x_test)
#model evaluation function
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import cross_val_score

def model_evaluate(result_df, model, name, x_train, y_train, x_test, y_test):
    model.fit(x_train, y_train)
    score = model.score(x_train, y_train)
    r2 = r2_score(y_test, model.predict(x_test))
    r2cv = cross_val_score(model, x_train, y_train, cv = 5, n_jobs=-1).mean()
    rmse = -cross_val_score(model, x_train, y_train, cv = 5, scoring="neg_root_mean_squared_error", n_jobs=-1).mean()
    return pd.concat([result_df, pd.DataFrame({
        "Model": [name],
        "Score": [score],
        "R^2 Score": [r2],
        "R^2(CV) Score": [r2cv],
        "RMSE(CV)": [rmse]
    })], ignore_index = True)
result_df = pd.DataFrame(
    columns = ["Model", "R^2 Score", "R^2(CV) Score", "Score", "RMSE(CV)"]


from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1, random_state = 32)
result_df = model_evaluate(result_df, lasso_reg, "LASSO", x_train, y_train, x_test, y_test)


from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=.5)
result_df = model_evaluate(result_df, ridge_reg, "RIDGE", x_train, y_train, x_test, y_test)

Random Forest

from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=1000)
result_df = model_evaluate(result_df, rf_reg, "Random Forest", x_train, y_train, x_test, y_test)


from sklearn.ensemble import GradientBoostingRegressor
gbr_reg = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.01, max_depth=1, random_state=31)
result_df = model_evaluate(result_df, gbr_reg, "Gradient Boosting", x_train, y_train, x_test, y_test)


import xgboost as XGB
xgb = XGB.XGBRegressor(learning_rate=0.01, n_estimators=1000, objective='reg:squarederror', random_state = 31)
result_df = model_evaluate(result_df, xgb, "XGBoost", x_train, y_train, x_test, y_test)
Model R^2 Score R^2(CV) Score Score RMSE(CV)
0 LASSO 0.856342 0.832496 0.846727 31593.946886
1 RIDGE 0.856368 0.832531 0.846727 31590.819760
2 Random Forest 0.895982 0.877693 0.983182 26994.468702
3 Gradient Boosting 0.869701 0.859937 0.895763 28913.192718
4 XGBoost 0.907326 0.890311 0.995235 25477.275078

Preparing test set

Id MSSubClass MSZoning LotFrontage LotArea Street Alley LotShape LandContour Utilities ... ScreenPorch PoolArea PoolQC Fence MiscFeature MiscVal MoSold YrSold SaleType SaleCondition
0 1461 20 RH 80.0 11622 Pave NaN Reg Lvl AllPub ... 120 0 NaN MnPrv NaN 0 6 2010 WD Normal
1 1462 20 RL 81.0 14267 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN Gar2 12500 6 2010 WD Normal
2 1463 60 RL 74.0 13830 Pave NaN IR1 Lvl AllPub ... 0 0 NaN MnPrv NaN 0 3 2010 WD Normal
3 1464 60 RL 78.0 9978 Pave NaN IR1 Lvl AllPub ... 0 0 NaN NaN NaN 0 6 2010 WD Normal
4 1465 120 RL 43.0 5005 Pave NaN IR1 HLS AllPub ... 144 0 NaN NaN NaN 0 1 2010 WD Normal

5 rows × 80 columns

A lot of null values reflect poor quality of data. Drop less important features and columns with number of null values > 500

null_values = test.loc[:, test.isnull().sum() > 500]
test.drop(null_values, axis = 1, inplace = True)
test.drop(less_important, axis = 1, inplace = True)
# checking for null values in test set
LotFrontage 227
LotArea 0
Neighborhood 0
Condition1 0
HouseStyle 0
OverallQual 0
YearBuilt 0
YearRemodAdd 0
RoofStyle 0
RoofMatl 0
Exterior1st 1
Exterior2nd 1
MasVnrArea 15
ExterCond 0
Foundation 0
BsmtCond 45
BsmtFinSF1 1
BsmtFinType2 42
BsmtUnfSF 1
TotalBsmtSF 1
CentralAir 0
Electrical 0
1stFlrSF 0
2ndFlrSF 0
GrLivArea 0
BsmtFullBath 2
FullBath 0
HalfBath 0
BedroomAbvGr 0
TotRmsAbvGrd 0
Functional 2
Fireplaces 0
GarageYrBlt 78
GarageCars 1
GarageArea 1
GarageQual 78
GarageCond 78
PavedDrive 0
WoodDeckSF 0
OpenPorchSF 0
ScreenPorch 0
SaleCondition 0
# taking care of null values
test['LotFrontage'].fillna(test['LotFrontage'].mean(), inplace = True)
test['MasVnrArea'].fillna(0 , inplace = True)
test['BsmtCond'].fillna('NA' , inplace = True)
test['BsmtFinType2'].fillna('NA' , inplace = True)
test['Electrical'].fillna('SBrkr' , inplace = True)
test['GarageYrBlt'].fillna(0 , inplace = True)
test['GarageQual'].fillna('NA' , inplace = True)
test['GarageCond'].fillna('NA' , inplace = True)
test['Exterior1st'].fillna('VinylSd' , inplace = True)
test['Exterior2nd'].fillna('VinylSd' , inplace = True)
test['BsmtFinSF1'].fillna(0 , inplace = True)
test['BsmtUnfSF'].fillna(0 , inplace = True)
test['TotalBsmtSF'].fillna(0 , inplace = True)
test['BsmtFullBath'].fillna(0 , inplace = True)
test['Functional'].fillna('Typ' , inplace = True)
test['GarageCars'].fillna(0 , inplace = True)
test['GarageArea'].fillna(0, inplace = True)
# reducing Skewness and Kurtosis
test['LotFrontage'] = np.cbrt(test['LotFrontage'])
test['GrLivArea'] = np.log(test['GrLivArea'])
test['LotArea'] = np.log(test['LotArea'])
# labelencode test data
test = test.apply(LabelEncoder().fit_transform)
# scale test data
test = scale.transform(test)

Model Comparison

The less the Root Mean Squared Error (RMSE), The better the model is.

Model R^2 Score R^2(CV) Score Score RMSE(CV)
4 XGBoost 0.907326 0.890311 0.995235 25477.275078
2 Random Forest 0.895982 0.877693 0.983182 26994.468702
3 Gradient Boosting 0.869701 0.859937 0.895763 28913.192718
1 RIDGE 0.856368 0.832531 0.846727 31590.819760
0 LASSO 0.856342 0.832496 0.846727 31593.946886
sns.barplot(x="Model", y="RMSE(CV)", data=result_df)
plt.title("Models' RMSE Scores (Cross-Validated)", size=15)
plt.xticks(rotation=30, size=12)


ax = result_df.plot(x="Model", y=["Score", "R^2 Score", "R^2(CV) Score"], kind='bar', figsize=(6, 6))


As a result, we think the XGBoost is the best model.

Creating submission file

# predict with XGBoost
xgb.fit(x, y)
predictions = xgb.predict(test)
sample_sub = pd.read_csv("./house-prices-advanced-regression-techniques/sample_submission.csv")
final_data = {'Id': sample_sub.Id, 'SalePrice': predictions}
final_submission = pd.DataFrame(data=final_data)
final_submission.to_csv('submission_file.csv', index =False)