先说好,这个是我找的,而且我也不知道是从哪里找的……
Overview¶
This notebook covers the step by step process of building a Machine Learning model to predict the House price. As in a standard machine learning workflow, the process is divided into the following steps:
- Understanding the Problem;
- Exploratory Data Analysis;
- Data Preprocessing;
- Feature Selection;
- Modeling;
- Evaluation.
Understanding the problem¶
A dataset was given which has 79 explanatory variables describing (almost) every aspect of residential homes. Our task is to find prices of new houses using Machine Learning.
# basic library import section
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
pd.set_option('display.max_rows', None) # display all rows
warnings.filterwarnings('ignore')
%matplotlib inline
import matplotlib_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg')
# loading our two required dataset
train = pd.read_csv(f'./house-prices-advanced-regression-techniques/train.csv')
test = pd.read_csv(f'./house-prices-advanced-regression-techniques/test.csv')
train.shape, test.shape
((1460, 81), (1459, 80))
train.head(20)
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | SalePrice | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1 | 60 | RL | 65.0 | 8450 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 208500 |
1 | 2 | 20 | RL | 80.0 | 9600 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 5 | 2007 | WD | Normal | 181500 |
2 | 3 | 60 | RL | 68.0 | 11250 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 223500 |
3 | 4 | 70 | RL | 60.0 | 9550 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2006 | WD | Abnorml | 140000 |
4 | 5 | 60 | RL | 84.0 | 14260 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 12 | 2008 | WD | Normal | 250000 |
5 | 6 | 50 | RL | 85.0 | 14115 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | MnPrv | Shed | 700 | 10 | 2009 | WD | Normal | 143000 |
6 | 7 | 20 | RL | 75.0 | 10084 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | WD | Normal | 307000 |
7 | 8 | 60 | RL | NaN | 10382 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 350 | 11 | 2009 | WD | Normal | 200000 |
8 | 9 | 50 | RM | 51.0 | 6120 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 4 | 2008 | WD | Abnorml | 129900 |
9 | 10 | 190 | RL | 50.0 | 7420 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 1 | 2008 | WD | Normal | 118000 |
10 | 11 | 20 | RL | 70.0 | 11200 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 2 | 2008 | WD | Normal | 129500 |
11 | 12 | 60 | RL | 85.0 | 11924 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 7 | 2006 | New | Partial | 345000 |
12 | 13 | 20 | RL | NaN | 12968 | Pave | NaN | IR2 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 9 | 2008 | WD | Normal | 144000 |
13 | 14 | 20 | RL | 91.0 | 10652 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 8 | 2007 | New | Partial | 279500 |
14 | 15 | 20 | RL | NaN | 10920 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | GdWo | NaN | 0 | 5 | 2008 | WD | Normal | 157000 |
15 | 16 | 45 | RM | 51.0 | 6120 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | GdPrv | NaN | 0 | 7 | 2007 | WD | Normal | 132000 |
16 | 17 | 20 | RL | NaN | 11241 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 700 | 3 | 2010 | WD | Normal | 149000 |
17 | 18 | 90 | RL | 72.0 | 10791 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | Shed | 500 | 10 | 2006 | WD | Normal | 90000 |
18 | 19 | 20 | RL | 66.0 | 13695 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | NaN | NaN | 0 | 6 | 2008 | WD | Normal | 159000 |
19 | 20 | 20 | RL | 70.0 | 7560 | Pave | NaN | Reg | Lvl | AllPub | ... | 0 | NaN | MnPrv | NaN | 0 | 5 | 2009 | COD | Abnorml | 139000 |
20 rows × 81 columns
Let's print a concise summary of our train DataFrame.
train.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 Id 1460 non-null int64
1 MSSubClass 1460 non-null int64
2 MSZoning 1460 non-null object
3 LotFrontage 1201 non-null float64
4 LotArea 1460 non-null int64
5 Street 1460 non-null object
6 Alley 91 non-null object
7 LotShape 1460 non-null object
8 LandContour 1460 non-null object
9 Utilities 1460 non-null object
10 LotConfig 1460 non-null object
11 LandSlope 1460 non-null object
12 Neighborhood 1460 non-null object
13 Condition1 1460 non-null object
14 Condition2 1460 non-null object
15 BldgType 1460 non-null object
16 HouseStyle 1460 non-null object
17 OverallQual 1460 non-null int64
18 OverallCond 1460 non-null int64
19 YearBuilt 1460 non-null int64
20 YearRemodAdd 1460 non-null int64
21 RoofStyle 1460 non-null object
22 RoofMatl 1460 non-null object
23 Exterior1st 1460 non-null object
24 Exterior2nd 1460 non-null object
25 MasVnrType 1452 non-null object
26 MasVnrArea 1452 non-null float64
27 ExterQual 1460 non-null object
28 ExterCond 1460 non-null object
29 Foundation 1460 non-null object
30 BsmtQual 1423 non-null object
31 BsmtCond 1423 non-null object
32 BsmtExposure 1422 non-null object
33 BsmtFinType1 1423 non-null object
34 BsmtFinSF1 1460 non-null int64
35 BsmtFinType2 1422 non-null object
36 BsmtFinSF2 1460 non-null int64
37 BsmtUnfSF 1460 non-null int64
38 TotalBsmtSF 1460 non-null int64
39 Heating 1460 non-null object
40 HeatingQC 1460 non-null object
41 CentralAir 1460 non-null object
42 Electrical 1459 non-null object
43 1stFlrSF 1460 non-null int64
44 2ndFlrSF 1460 non-null int64
45 LowQualFinSF 1460 non-null int64
46 GrLivArea 1460 non-null int64
47 BsmtFullBath 1460 non-null int64
48 BsmtHalfBath 1460 non-null int64
49 FullBath 1460 non-null int64
50 HalfBath 1460 non-null int64
51 BedroomAbvGr 1460 non-null int64
52 KitchenAbvGr 1460 non-null int64
53 KitchenQual 1460 non-null object
54 TotRmsAbvGrd 1460 non-null int64
55 Functional 1460 non-null object
56 Fireplaces 1460 non-null int64
57 FireplaceQu 770 non-null object
58 GarageType 1379 non-null object
59 GarageYrBlt 1379 non-null float64
60 GarageFinish 1379 non-null object
61 GarageCars 1460 non-null int64
62 GarageArea 1460 non-null int64
63 GarageQual 1379 non-null object
64 GarageCond 1379 non-null object
65 PavedDrive 1460 non-null object
66 WoodDeckSF 1460 non-null int64
67 OpenPorchSF 1460 non-null int64
68 EnclosedPorch 1460 non-null int64
69 3SsnPorch 1460 non-null int64
70 ScreenPorch 1460 non-null int64
71 PoolArea 1460 non-null int64
72 PoolQC 7 non-null object
73 Fence 281 non-null object
74 MiscFeature 54 non-null object
75 MiscVal 1460 non-null int64
76 MoSold 1460 non-null int64
77 YrSold 1460 non-null int64
78 SaleType 1460 non-null object
79 SaleCondition 1460 non-null object
80 SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
descriptive statistics of our train set
train.describe().T
count | mean | std | min | 25% | 50% | 75% | max | |
---|---|---|---|---|---|---|---|---|
Id | 1460.0 | 730.500000 | 421.610009 | 1.0 | 365.75 | 730.5 | 1095.25 | 1460.0 |
MSSubClass | 1460.0 | 56.897260 | 42.300571 | 20.0 | 20.00 | 50.0 | 70.00 | 190.0 |
LotFrontage | 1201.0 | 70.049958 | 24.284752 | 21.0 | 59.00 | 69.0 | 80.00 | 313.0 |
LotArea | 1460.0 | 10516.828082 | 9981.264932 | 1300.0 | 7553.50 | 9478.5 | 11601.50 | 215245.0 |
OverallQual | 1460.0 | 6.099315 | 1.382997 | 1.0 | 5.00 | 6.0 | 7.00 | 10.0 |
OverallCond | 1460.0 | 5.575342 | 1.112799 | 1.0 | 5.00 | 5.0 | 6.00 | 9.0 |
YearBuilt | 1460.0 | 1971.267808 | 30.202904 | 1872.0 | 1954.00 | 1973.0 | 2000.00 | 2010.0 |
YearRemodAdd | 1460.0 | 1984.865753 | 20.645407 | 1950.0 | 1967.00 | 1994.0 | 2004.00 | 2010.0 |
MasVnrArea | 1452.0 | 103.685262 | 181.066207 | 0.0 | 0.00 | 0.0 | 166.00 | 1600.0 |
BsmtFinSF1 | 1460.0 | 443.639726 | 456.098091 | 0.0 | 0.00 | 383.5 | 712.25 | 5644.0 |
BsmtFinSF2 | 1460.0 | 46.549315 | 161.319273 | 0.0 | 0.00 | 0.0 | 0.00 | 1474.0 |
BsmtUnfSF | 1460.0 | 567.240411 | 441.866955 | 0.0 | 223.00 | 477.5 | 808.00 | 2336.0 |
TotalBsmtSF | 1460.0 | 1057.429452 | 438.705324 | 0.0 | 795.75 | 991.5 | 1298.25 | 6110.0 |
1stFlrSF | 1460.0 | 1162.626712 | 386.587738 | 334.0 | 882.00 | 1087.0 | 1391.25 | 4692.0 |
2ndFlrSF | 1460.0 | 346.992466 | 436.528436 | 0.0 | 0.00 | 0.0 | 728.00 | 2065.0 |
LowQualFinSF | 1460.0 | 5.844521 | 48.623081 | 0.0 | 0.00 | 0.0 | 0.00 | 572.0 |
GrLivArea | 1460.0 | 1515.463699 | 525.480383 | 334.0 | 1129.50 | 1464.0 | 1776.75 | 5642.0 |
BsmtFullBath | 1460.0 | 0.425342 | 0.518911 | 0.0 | 0.00 | 0.0 | 1.00 | 3.0 |
BsmtHalfBath | 1460.0 | 0.057534 | 0.238753 | 0.0 | 0.00 | 0.0 | 0.00 | 2.0 |
FullBath | 1460.0 | 1.565068 | 0.550916 | 0.0 | 1.00 | 2.0 | 2.00 | 3.0 |
HalfBath | 1460.0 | 0.382877 | 0.502885 | 0.0 | 0.00 | 0.0 | 1.00 | 2.0 |
BedroomAbvGr | 1460.0 | 2.866438 | 0.815778 | 0.0 | 2.00 | 3.0 | 3.00 | 8.0 |
KitchenAbvGr | 1460.0 | 1.046575 | 0.220338 | 0.0 | 1.00 | 1.0 | 1.00 | 3.0 |
TotRmsAbvGrd | 1460.0 | 6.517808 | 1.625393 | 2.0 | 5.00 | 6.0 | 7.00 | 14.0 |
Fireplaces | 1460.0 | 0.613014 | 0.644666 | 0.0 | 0.00 | 1.0 | 1.00 | 3.0 |
GarageYrBlt | 1379.0 | 1978.506164 | 24.689725 | 1900.0 | 1961.00 | 1980.0 | 2002.00 | 2010.0 |
GarageCars | 1460.0 | 1.767123 | 0.747315 | 0.0 | 1.00 | 2.0 | 2.00 | 4.0 |
GarageArea | 1460.0 | 472.980137 | 213.804841 | 0.0 | 334.50 | 480.0 | 576.00 | 1418.0 |
WoodDeckSF | 1460.0 | 94.244521 | 125.338794 | 0.0 | 0.00 | 0.0 | 168.00 | 857.0 |
OpenPorchSF | 1460.0 | 46.660274 | 66.256028 | 0.0 | 0.00 | 25.0 | 68.00 | 547.0 |
EnclosedPorch | 1460.0 | 21.954110 | 61.119149 | 0.0 | 0.00 | 0.0 | 0.00 | 552.0 |
3SsnPorch | 1460.0 | 3.409589 | 29.317331 | 0.0 | 0.00 | 0.0 | 0.00 | 508.0 |
ScreenPorch | 1460.0 | 15.060959 | 55.757415 | 0.0 | 0.00 | 0.0 | 0.00 | 480.0 |
PoolArea | 1460.0 | 2.758904 | 40.177307 | 0.0 | 0.00 | 0.0 | 0.00 | 738.0 |
MiscVal | 1460.0 | 43.489041 | 496.123024 | 0.0 | 0.00 | 0.0 | 0.00 | 15500.0 |
MoSold | 1460.0 | 6.321918 | 2.703626 | 1.0 | 5.00 | 6.0 | 8.00 | 12.0 |
YrSold | 1460.0 | 2007.815753 | 1.328095 | 2006.0 | 2007.00 | 2008.0 | 2009.00 | 2010.0 |
SalePrice | 1460.0 | 180921.195890 | 79442.502883 | 34900.0 | 129975.00 | 163000.0 | 214000.00 | 755000.0 |
Feature Selection¶
Seperating numaric columns and categorical columns
numeric_cols = train.select_dtypes(exclude=['object'])
categorical_cols = train.select_dtypes(include=['object'])
#finding important features
correlation_num = numeric_cols.corr()
correlation_num.sort_values(["SalePrice"], ascending = False, inplace = True)
correlation_num.SalePrice
SalePrice 1.000000
OverallQual 0.790982
GrLivArea 0.708624
GarageCars 0.640409
GarageArea 0.623431
TotalBsmtSF 0.613581
1stFlrSF 0.605852
FullBath 0.560664
TotRmsAbvGrd 0.533723
YearBuilt 0.522897
YearRemodAdd 0.507101
GarageYrBlt 0.486362
MasVnrArea 0.477493
Fireplaces 0.466929
BsmtFinSF1 0.386420
LotFrontage 0.351799
WoodDeckSF 0.324413
2ndFlrSF 0.319334
OpenPorchSF 0.315856
HalfBath 0.284108
LotArea 0.263843
BsmtFullBath 0.227122
BsmtUnfSF 0.214479
BedroomAbvGr 0.168213
ScreenPorch 0.111447
PoolArea 0.092404
MoSold 0.046432
3SsnPorch 0.044584
BsmtFinSF2 -0.011378
BsmtHalfBath -0.016844
MiscVal -0.021190
Id -0.021917
LowQualFinSF -0.025606
YrSold -0.028923
OverallCond -0.077856
MSSubClass -0.084284
EnclosedPorch -0.128578
KitchenAbvGr -0.135907
Name: SalePrice, dtype: float64
# encode categorical columns
from sklearn.preprocessing import LabelEncoder
cat_le = categorical_cols.apply(LabelEncoder().fit_transform)
cat_le['SalePrice'] = train['SalePrice']
# find important features from categorical values
correlation_cat = cat_le.corr()
correlation_cat.sort_values(["SalePrice"], ascending = False, inplace = True)
correlation_cat.SalePrice
SalePrice 1.000000
Foundation 0.382479
CentralAir 0.251328
Electrical 0.234716
PavedDrive 0.231357
RoofStyle 0.222405
SaleCondition 0.213092
Neighborhood 0.210851
HouseStyle 0.180163
Fence 0.140640
Alley 0.139868
RoofMatl 0.132383
ExterCond 0.117303
Functional 0.115328
Exterior2nd 0.103766
Exterior1st 0.103551
Condition1 0.091155
MiscFeature 0.073609
LandSlope 0.051152
Street 0.041036
MasVnrType 0.029658
GarageCond 0.025149
LandContour 0.015453
BsmtCond 0.015058
BsmtFinType2 0.008041
Condition2 0.007513
GarageQual 0.006861
Utilities -0.014314
SaleType -0.054911
LotConfig -0.067396
BldgType -0.085591
Heating -0.098812
BsmtFinType1 -0.103114
PoolQC -0.126070
MSZoning -0.166872
LotShape -0.255580
BsmtExposure -0.309043
HeatingQC -0.400178
GarageType -0.415283
FireplaceQu -0.459605
GarageFinish -0.549247
KitchenQual -0.589189
BsmtQual -0.620886
ExterQual -0.636884
Name: SalePrice, dtype: float64
Visualizing important features
fig, axarr = plt.subplots(2, 1, figsize = (14, 18))
correlation_num.SalePrice.plot.bar(ax=axarr[0])
correlation_cat.SalePrice.plot.bar(ax=axarr[1])
axarr[0].set_title("Feature importance of numaric columns")
axarr[1].set_title("Feature importance of categorical columns");
{loading=lazy}
Data Preprocessing (Train set)¶
# dropping colmuns where number of null values is greater than 500
null_values = train.loc[:, train.isnull().sum() > 500]
train.drop(null_values, axis = 1, inplace = True)
Let's remove features with less importance. Less important features was selected by a smaller absolute value of correlation score.
# list of less important features
less_important = ['Id', 'MSSubClass', 'OverallCond', 'BsmtFinSF2', 'LowQualFinSF', 'BsmtHalfBath', 'KitchenAbvGr', 'EnclosedPorch',
'3SsnPorch', 'PoolArea', 'MiscVal', 'MoSold', 'YrSold', 'MSZoning', 'Street', 'LotShape', 'LandContour', 'Utilities',
'LotConfig', 'LandSlope', 'Condition2', 'BldgType', 'MasVnrType', 'ExterQual', 'BsmtQual', 'BsmtExposure','BsmtFinType1',
'Heating', 'HeatingQC', 'KitchenQual', 'GarageType', 'GarageFinish','SaleType']
# dropping less important columns
train.drop(less_important, axis = 1, inplace = True)
Let's check for null values.
pd.DataFrame(train.isna().sum())
0 | |
---|---|
LotFrontage | 259 |
LotArea | 0 |
Neighborhood | 0 |
Condition1 | 0 |
HouseStyle | 0 |
OverallQual | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
RoofStyle | 0 |
RoofMatl | 0 |
Exterior1st | 0 |
Exterior2nd | 0 |
MasVnrArea | 8 |
ExterCond | 0 |
Foundation | 0 |
BsmtCond | 37 |
BsmtFinSF1 | 0 |
BsmtFinType2 | 38 |
BsmtUnfSF | 0 |
TotalBsmtSF | 0 |
CentralAir | 0 |
Electrical | 1 |
1stFlrSF | 0 |
2ndFlrSF | 0 |
GrLivArea | 0 |
BsmtFullBath | 0 |
FullBath | 0 |
HalfBath | 0 |
BedroomAbvGr | 0 |
TotRmsAbvGrd | 0 |
Functional | 0 |
Fireplaces | 0 |
GarageYrBlt | 81 |
GarageCars | 0 |
GarageArea | 0 |
GarageQual | 81 |
GarageCond | 81 |
PavedDrive | 0 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
ScreenPorch | 0 |
SaleCondition | 0 |
SalePrice | 0 |
# filling null values
# for numerical columns, we will fill null values with mean of the rest of the column
train['LotFrontage'].fillna(train['LotFrontage'].mean(), inplace = True)
# for categorical columns, we will fill null values with stardard mode of that column
train['MasVnrArea'].fillna(0 , inplace = True)
train['BsmtCond'].fillna('NA' , inplace = True)
train['BsmtFinType2'].fillna('NA' , inplace = True)
train['Electrical'].fillna('SBrkr' , inplace = True)
train['GarageYrBlt'].fillna(0 , inplace = True)
train['GarageQual'].fillna('NA' , inplace = True)
train['GarageCond'].fillna('NA' , inplace = True)
EDA¶
The overall quality of a house is a very important factor of the house price as indicated by the correlation value.
plt.scatter(train.OverallQual, train.SalePrice, marker ="^")
plt.xlabel("OverallQual")
plt.ylabel("SalePrice")
plt.show()
{loading=lazy}
It appear to be very correlated to the sales price, and we'll explore it again later.
Accroding to the correlation as well as our natural sense, ground living area is one of the major factor of the sale price. Let's first plot a scatter plot visualizing GrLivArea
and SalePrice
plt.scatter(train.GrLivArea, train.SalePrice, c = "lightcoral", s=10)
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()
{loading=lazy}
There are some aberrating values that saying Ground living area is bigger than 4000
. We will treat them as outliers.
# removing outliers
train = train[train.GrLivArea < 4000]
Let's plot another scatter plot visualizing LotArea
and SalePrice
plt.scatter(train.LotArea, train.SalePrice, c = "chocolate", s=10)
plt.xlabel("LotArea")
plt.ylabel("SalePrice")
plt.show()
{loading=lazy}
We will consider LotArea
greater than 150000
as outliers.
#removing outliers
train = train[train.LotArea < 150000]
Another scatter plot of LotFrontage
w.r.t SalePrice
plt.scatter(train.LotFrontage, train.SalePrice, c = "green", s=10)
plt.xlabel("LotFrontage")
plt.ylabel("SalePrice")
plt.show()
{loading=lazy}
OverallQual
refers overall quality of the house. This is a important feature. SalePrice largly depends on it. Because if the house quality is Very Excellent
than it is more likely to be sold with high price. Let's analyse this column.
labels = 'Average', 'Above Average', 'Good', 'Very Good', 'Below Average','Excellent', 'Fair', 'Very Excellent', 'Poor', 'Very Poor'
explode = (0, 0.0, 0.0, 0.1, 0.1, 0.1, 0.2, 0.3, 0.5, 0.7)
fig1, ax1 = plt.subplots()
ax1.pie(train['OverallQual'].value_counts(), explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=30)
ax1.axis('equal')
plt.show()
{loading=lazy}
27.3%
houses of the train dataset has Average
quality. And 0.1%
houses are Very Poor
in quality.
Let's see the SalePrice
variation w.r.t OverallQual
fig = sns.barplot(x = 'OverallQual',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['Very Poor', 'Poor', 'Fair', 'Below Average', 'Average', 'Above Average', 'Good', 'Very Good', 'Excellent', 'Very Excellent'], rotation=90);
{loading=lazy}
Foundation
is another important feature. It represent how strong a bulding can be. Buildings life depends on it. So, this column worth analysing.
labels = 'Poured Contrete', 'Cinder Block', 'Brick & Tile', 'Slab', 'Stone', 'Wood'
explode = (0, 0.0, 0.0, 0.1, 0.3, 0.5)
fig1, ax1 = plt.subplots()
ax1.pie(train['Foundation'].value_counts(), explode=explode, labels=labels, autopct='%1.1f%%',
shadow=True, startangle=30)
ax1.axis('equal')
plt.show()
{loading=lazy}
fig = sns.barplot(x = 'Foundation',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['Poured Contrete', 'Cinder Block', 'Brick & Tile', 'Wood', 'Slab', 'Stone'], rotation=45)
plt.xlabel("Types of Foundation");
{loading=lazy}
Let's see how SalePrice
varies w.r.t GarageCars
fig = sns.barplot(x = 'GarageCars',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['No car', '1 car', '2 cars', '3 cars', '4 cars'], rotation=45)
plt.xlabel("Number of cars in Garage");
{loading=lazy}
fig = sns.barplot(x = 'Fireplaces',y = 'SalePrice', data = train)
fig.set_xticklabels(labels=['No Fireplace', '1 Fireplaces', '2 Fireplaces', '3 Fireplaces'], rotation=45)
plt.xlabel("Number of Fireplaces");
{loading=lazy}
Let's plot a distribution plot of YearBuilt
column representing the year of a house was bult w.r.t saleprice
sns.displot(x = 'YearBuilt', y = 'SalePrice', data = train);
{loading=lazy}
corr = train.corr()
mask = np.triu(np.ones_like(corr, dtype=bool))
f, ax = plt.subplots(figsize=(11, 9))
cmap = sns.diverging_palette(230, 20, as_cmap=True)
sns.heatmap(corr, mask=mask, cmap=cmap, vmax=.3, center=0,
square=True, linewidths=.5, cbar_kws={"shrink": .5});
{loading=lazy}
Feature Engineering¶
Reducing Skewness
and kurtosis
from data.¶
sns.displot(x = 'LotArea', data = train, kde = True)
skewness = str(train["LotArea"].skew())
kurtosis = str(train["LotArea"].kurt())
plt.legend([skewness, kurtosis], title=("skewness and kurtosis"))
plt.title("Before applying transform technique")
plt.show()
{loading=lazy}
#applying log transform
train['LotArea']=np.log(train['LotArea'])
sns.displot(x = 'LotArea', data = train, kde = True)
skewness = str(train["LotArea"].skew())
kurtosis = str(train["LotArea"].kurt())
plt.legend([skewness, kurtosis],title=("skewness and kurtosis"))
plt.title("After applying transform technique")
plt.show()
{loading=lazy}
sns.displot(x = 'GrLivArea', data = train, kde = True)
skewness=str(train["GrLivArea"].skew())
kurtosis=str(train["GrLivArea"].kurt())
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("Before applying transform technique")
plt.show()
{loading=lazy}
train['GrLivArea']=np.log(train['GrLivArea'])
sns.displot(x = 'GrLivArea', data = train, kde = True)
skewness=str(train["GrLivArea"].skew())
kurtosis=str(train["GrLivArea"].kurt())
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("After applying transform technique")
plt.show()
{loading=lazy}
sns.displot(x = 'LotFrontage', data = train, kde = True)
skewness=str(train["LotFrontage"].skew())
kurtosis=str(train["LotFrontage"].kurt())
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("Before applying transform technique")
plt.show()
{loading=lazy}
train['LotFrontage'] = np.cbrt(train['LotFrontage'])
sns.displot(x = 'LotFrontage', data = train, kde = True)
skewness=str(train["LotFrontage"].skew())
kurtosis=str(train["LotFrontage"].kurt())
plt.legend([skewness,kurtosis],title=("skewness and kurtosis"))
plt.title("After applying transform technique")
plt.show()
{loading=lazy}
Modeling¶
It's time to create our independent and dependent matrix of feature.
x = train.drop(['SalePrice'], axis = 1)
y = train['SalePrice']
#labelencoding categorical variables from x
from sklearn.preprocessing import LabelEncoder
x = x.apply(LabelEncoder().fit_transform)
x.head()
LotFrontage | LotArea | Neighborhood | Condition1 | HouseStyle | OverallQual | YearBuilt | YearRemodAdd | RoofStyle | RoofMatl | ... | GarageYrBlt | GarageCars | GarageArea | GarageQual | GarageCond | PavedDrive | WoodDeckSF | OpenPorchSF | ScreenPorch | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 36 | 327 | 5 | 2 | 5 | 6 | 104 | 53 | 1 | 0 | ... | 90 | 2 | 220 | 5 | 5 | 2 | 0 | 49 | 0 | 4 |
1 | 52 | 498 | 24 | 1 | 2 | 5 | 77 | 26 | 1 | 0 | ... | 63 | 2 | 149 | 5 | 5 | 2 | 187 | 0 | 0 | 4 |
2 | 39 | 702 | 5 | 2 | 5 | 6 | 102 | 52 | 1 | 0 | ... | 88 | 2 | 260 | 5 | 5 | 2 | 0 | 30 | 0 | 4 |
3 | 31 | 489 | 6 | 2 | 5 | 6 | 19 | 20 | 1 | 0 | ... | 85 | 3 | 284 | 5 | 5 | 2 | 0 | 24 | 0 | 0 |
4 | 56 | 925 | 15 | 2 | 5 | 7 | 101 | 50 | 1 | 0 | ... | 87 | 3 | 378 | 5 | 5 | 2 | 118 | 70 | 0 | 4 |
5 rows × 42 columns
y.head()
0 208500
1 181500
2 223500
3 140000
4 250000
Name: SalePrice, dtype: int64
x.shape, y.shape
((1453, 42), (1453,))
#splitting the dataset into train and test set.
from sklearn.model_selection import train_test_split
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state = 31)
len(x_train), len(x_test), len(y_train), len(y_test)
(1089, 364, 1089, 364)
# feature scaling with x = (x - mean(x)) / std(x)
from sklearn.preprocessing import StandardScaler
scale = StandardScaler()
x_train = scale.fit_transform(x_train)
x_test = scale.transform(x_test)
#model evaluation function
from sklearn.metrics import r2_score, mean_absolute_error
from sklearn.model_selection import cross_val_score
def model_evaluate(result_df, model, name, x_train, y_train, x_test, y_test):
model.fit(x_train, y_train)
score = model.score(x_train, y_train)
r2 = r2_score(y_test, model.predict(x_test))
r2cv = cross_val_score(model, x_train, y_train, cv = 5, n_jobs=-1).mean()
rmse = -cross_val_score(model, x_train, y_train, cv = 5, scoring="neg_root_mean_squared_error", n_jobs=-1).mean()
return pd.concat([result_df, pd.DataFrame({
"Model": [name],
"Score": [score],
"R^2 Score": [r2],
"R^2(CV) Score": [r2cv],
"RMSE(CV)": [rmse]
})], ignore_index = True)
result_df = pd.DataFrame(
columns = ["Model", "R^2 Score", "R^2(CV) Score", "Score", "RMSE(CV)"]
)
Lasso¶
from sklearn.linear_model import Lasso
lasso_reg = Lasso(alpha=0.1, random_state = 32)
result_df = model_evaluate(result_df, lasso_reg, "LASSO", x_train, y_train, x_test, y_test)
Ridge¶
from sklearn.linear_model import Ridge
ridge_reg = Ridge(alpha=.5)
result_df = model_evaluate(result_df, ridge_reg, "RIDGE", x_train, y_train, x_test, y_test)
Random Forest¶
from sklearn.ensemble import RandomForestRegressor
rf_reg = RandomForestRegressor(n_estimators=1000)
result_df = model_evaluate(result_df, rf_reg, "Random Forest", x_train, y_train, x_test, y_test)
GradientBoostingRegressor¶
from sklearn.ensemble import GradientBoostingRegressor
gbr_reg = GradientBoostingRegressor(n_estimators=1000, learning_rate=0.01, max_depth=1, random_state=31)
result_df = model_evaluate(result_df, gbr_reg, "Gradient Boosting", x_train, y_train, x_test, y_test)
Xgboost¶
import xgboost as XGB
xgb = XGB.XGBRegressor(learning_rate=0.01, n_estimators=1000, objective='reg:squarederror', random_state = 31)
result_df = model_evaluate(result_df, xgb, "XGBoost", x_train, y_train, x_test, y_test)
result_df
Model | R^2 Score | R^2(CV) Score | Score | RMSE(CV) | |
---|---|---|---|---|---|
0 | LASSO | 0.856342 | 0.832496 | 0.846727 | 31593.946886 |
1 | RIDGE | 0.856368 | 0.832531 | 0.846727 | 31590.819760 |
2 | Random Forest | 0.895982 | 0.877693 | 0.983182 | 26994.468702 |
3 | Gradient Boosting | 0.869701 | 0.859937 | 0.895763 | 28913.192718 |
4 | XGBoost | 0.907326 | 0.890311 | 0.995235 | 25477.275078 |
Preparing test set¶
test.head()
Id | MSSubClass | MSZoning | LotFrontage | LotArea | Street | Alley | LotShape | LandContour | Utilities | ... | ScreenPorch | PoolArea | PoolQC | Fence | MiscFeature | MiscVal | MoSold | YrSold | SaleType | SaleCondition | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
0 | 1461 | 20 | RH | 80.0 | 11622 | Pave | NaN | Reg | Lvl | AllPub | ... | 120 | 0 | NaN | MnPrv | NaN | 0 | 6 | 2010 | WD | Normal |
1 | 1462 | 20 | RL | 81.0 | 14267 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | Gar2 | 12500 | 6 | 2010 | WD | Normal |
2 | 1463 | 60 | RL | 74.0 | 13830 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | MnPrv | NaN | 0 | 3 | 2010 | WD | Normal |
3 | 1464 | 60 | RL | 78.0 | 9978 | Pave | NaN | IR1 | Lvl | AllPub | ... | 0 | 0 | NaN | NaN | NaN | 0 | 6 | 2010 | WD | Normal |
4 | 1465 | 120 | RL | 43.0 | 5005 | Pave | NaN | IR1 | HLS | AllPub | ... | 144 | 0 | NaN | NaN | NaN | 0 | 1 | 2010 | WD | Normal |
5 rows × 80 columns
A lot of null values reflect poor quality of data. Drop less important features and columns with number of null values > 500
null_values = test.loc[:, test.isnull().sum() > 500]
test.drop(null_values, axis = 1, inplace = True)
test.drop(less_important, axis = 1, inplace = True)
# checking for null values in test set
pd.DataFrame(test.isna().sum())
0 | |
---|---|
LotFrontage | 227 |
LotArea | 0 |
Neighborhood | 0 |
Condition1 | 0 |
HouseStyle | 0 |
OverallQual | 0 |
YearBuilt | 0 |
YearRemodAdd | 0 |
RoofStyle | 0 |
RoofMatl | 0 |
Exterior1st | 1 |
Exterior2nd | 1 |
MasVnrArea | 15 |
ExterCond | 0 |
Foundation | 0 |
BsmtCond | 45 |
BsmtFinSF1 | 1 |
BsmtFinType2 | 42 |
BsmtUnfSF | 1 |
TotalBsmtSF | 1 |
CentralAir | 0 |
Electrical | 0 |
1stFlrSF | 0 |
2ndFlrSF | 0 |
GrLivArea | 0 |
BsmtFullBath | 2 |
FullBath | 0 |
HalfBath | 0 |
BedroomAbvGr | 0 |
TotRmsAbvGrd | 0 |
Functional | 2 |
Fireplaces | 0 |
GarageYrBlt | 78 |
GarageCars | 1 |
GarageArea | 1 |
GarageQual | 78 |
GarageCond | 78 |
PavedDrive | 0 |
WoodDeckSF | 0 |
OpenPorchSF | 0 |
ScreenPorch | 0 |
SaleCondition | 0 |
# taking care of null values
test['LotFrontage'].fillna(test['LotFrontage'].mean(), inplace = True)
test['MasVnrArea'].fillna(0 , inplace = True)
test['BsmtCond'].fillna('NA' , inplace = True)
test['BsmtFinType2'].fillna('NA' , inplace = True)
test['Electrical'].fillna('SBrkr' , inplace = True)
test['GarageYrBlt'].fillna(0 , inplace = True)
test['GarageQual'].fillna('NA' , inplace = True)
test['GarageCond'].fillna('NA' , inplace = True)
test['Exterior1st'].fillna('VinylSd' , inplace = True)
test['Exterior2nd'].fillna('VinylSd' , inplace = True)
test['BsmtFinSF1'].fillna(0 , inplace = True)
test['BsmtUnfSF'].fillna(0 , inplace = True)
test['TotalBsmtSF'].fillna(0 , inplace = True)
test['BsmtFullBath'].fillna(0 , inplace = True)
test['Functional'].fillna('Typ' , inplace = True)
test['GarageCars'].fillna(0 , inplace = True)
test['GarageArea'].fillna(0, inplace = True)
# reducing Skewness and Kurtosis
test['LotFrontage'] = np.cbrt(test['LotFrontage'])
test['GrLivArea'] = np.log(test['GrLivArea'])
test['LotArea'] = np.log(test['LotArea'])
# labelencode test data
test = test.apply(LabelEncoder().fit_transform)
# scale test data
test = scale.transform(test)
Model Comparison¶
The less the Root Mean Squared Error (RMSE), The better the model is.
result_df.sort_values(by="RMSE(CV)")
Model | R^2 Score | R^2(CV) Score | Score | RMSE(CV) | |
---|---|---|---|---|---|
4 | XGBoost | 0.907326 | 0.890311 | 0.995235 | 25477.275078 |
2 | Random Forest | 0.895982 | 0.877693 | 0.983182 | 26994.468702 |
3 | Gradient Boosting | 0.869701 | 0.859937 | 0.895763 | 28913.192718 |
1 | RIDGE | 0.856368 | 0.832531 | 0.846727 | 31590.819760 |
0 | LASSO | 0.856342 | 0.832496 | 0.846727 | 31593.946886 |
plt.figure(figsize=(12,8))
sns.barplot(x="Model", y="RMSE(CV)", data=result_df)
plt.title("Models' RMSE Scores (Cross-Validated)", size=15)
plt.xticks(rotation=30, size=12)
plt.show()
{loading=lazy}
ax = result_df.plot(x="Model", y=["Score", "R^2 Score", "R^2(CV) Score"], kind='bar', figsize=(6, 6))
{loading=lazy}
As a result, we think the XGBoost
is the best model.
Creating submission file¶
# predict with XGBoost
xgb.fit(x, y)
predictions = xgb.predict(test)
sample_sub = pd.read_csv("./house-prices-advanced-regression-techniques/sample_submission.csv")
final_data = {'Id': sample_sub.Id, 'SalePrice': predictions}
final_submission = pd.DataFrame(data=final_data)
final_submission.to_csv('submission_file.csv', index =False)