Ames Housing Data

First, show system information and load the packages we need.

In [1]:
versioninfo()
using StatsBase, Plots, DataFrames, FreqTables, ConstrainedLasso, Mosek, MultivariateStats, StatsFuns;
Julia Version 0.6.0
Commit 903644385b (2017-06-19 13:05 UTC)
Platform Info:
  OS: macOS (x86_64-apple-darwin13.4.0)
  CPU: Intel(R) Core(TM) i5-6267U CPU @ 2.90GHz
  WORD_SIZE: 64
  BLAS: libopenblas (USE64BITINT DYNAMIC_ARCH NO_AFFINITY Haswell)
  LAPACK: libopenblas64_
  LIBM: libopenlibm
  LLVM: libLLVM-3.9.1 (ORCJIT, skylake)

Let's load the data. The Ames dataset is available on Journal of Statistics Education (JSE) Data Archive.

In [2]:
raw_ames, raw_colnames = readcsv("ames.csv", header=true);

There are 82 columns which include 23 nominal, 23 ordinal, 14 discrete, and 20 continuous variables (and 2 additional observation identifiers).

In [3]:
raw_colnames
Out[3]:
1×82 Array{AbstractString,2}:
 "Order"  "PID"  "MS.SubClass"  "MS.Zoning"  …  "Sale.Condition"  "SalePrice"

In the column names, replace "." with "".

In [4]:
colnames = similar(raw_colnames[3:end])

for i in eachindex(raw_colnames[3:end]) 
    colnames[i] = replace(raw_colnames[3:end][i], ".", "")
end

Variables Order and PID are observation identifiers, hence they won't be included. Then we convert the data into a data frame.

In [5]:
ames_df = convert(DataFrame, raw_ames[:, 3:end])
rename!(ames_df, [(Symbol("x$i") => Symbol("$(colnames[i])")) for i in eachindex(colnames)])
head(ames_df)
Out[5]:
MSSubClassMSZoningLotFrontageLotAreaStreetAlleyLotShapeLandContourUtilitiesLotConfigLandSlopeNeighborhoodCondition1Condition2BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleRoofMatlExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingHeatingQCCentralAirElectricalX1stFlrSFX2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorchX3SsnPorchScreenPorchPoolAreaPoolQCFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
120RL14131770PaveNAIR1LvlAllPubCornerGtlNAmesNormNorm1Fam1Story6519601960HipCompShgBrkFacePlywoodStone112TATACBlockTAGdGdBLQ639Unf04411080GasAFaYSBrkr1656001656101031TA7Typ2GdAttchd1960Fin2528TATAP210620000NANANA052010WD Normal215000
220RH8011622PaveNARegLvlAllPubInsideGtlNAmesFeedrNorm1Fam1Story5619611961GableCompShgVinylSdVinylSdNone0TATACBlockTATANoRec468LwQ144270882GasATAYSBrkr89600896001021TA5Typ0NAAttchd1961Unf1730TATAY1400001200NAMnPrvNA062010WD Normal105000
320RL8114267PaveNAIR1LvlAllPubCornerGtlNAmesNormNorm1Fam1Story6619581958HipCompShgWd SdngWd SdngBrkFace108TATACBlockTATANoALQ923Unf04061329GasATAYSBrkr1329001329001131Gd6Typ0NAAttchd1958Unf1312TATAY393360000NANAGar21250062010WD Normal172000
420RL9311160PaveNARegLvlAllPubCornerGtlNAmesNormNorm1Fam1Story7519681968HipCompShgBrkFaceBrkFaceNone0GdTACBlockTATANoALQ1065Unf010452110GasAExYSBrkr2110002110102131Ex8Typ2TAAttchd1968Fin2522TATAY000000NANANA042010WD Normal244000
560RL7413830PaveNAIR1LvlAllPubInsideGtlGilbertNormNorm1Fam2Story5519971998GableCompShgVinylSdVinylSdNone0TATAPConcGdTANoGLQ791Unf0137928GasAGdYSBrkr92870101629002131TA6Typ1TAAttchd1997Fin2482TATAY212340000NAMnPrvNA032010WD Normal189900
660RL789978PaveNAIR1LvlAllPubInsideGtlGilbertNormNorm1Fam2Story6619981998GableCompShgVinylSdVinylSdBrkFace20TATAPConcTATANoGLQ602Unf0324926GasAExYSBrkr92667801604002131Gd7Typ1GdAttchd1998Fin2470TATAY360360000NANANA062010WD Normal195500

Now we identify nominal and ordinal variables. In addition to 46 nominal/ordinal variables specified in the data documentation, we also treat MoSold and YrSold (indicated as discrete in the documenation) as categorical variables since it makes more sense to see each month/year sold as one level without regard to the order. Hence, we have a total of 48 nominal and ordinal variables.

In [6]:
factor_vars = [:MSSubClass, :MSZoning, :Street, :Alley, :LotShape, 
         :LandContour, :Utilities, :LotConfig, :LandSlope, :Neighborhood,
        :Condition1, :Condition2, :BldgType, :HouseStyle, :OverallQual,
        :OverallCond, :RoofStyle, :RoofMatl, :Exterior1st, :Exterior2nd,
        :MasVnrType, :ExterQual, :ExterCond, :Foundation, :BsmtQual, 
        :BsmtCond, :BsmtExposure, :BsmtFinType1, :BsmtFinType2, 
        :Heating, :HeatingQC, :CentralAir, :Electrical, 
        :KitchenQual, :Functional, :FireplaceQu, :GarageType, :GarageFinish,
        :GarageQual, :GarageCond, :PavedDrive, :PoolQC, :Fence, :MiscFeature, 
        :SaleType, :SaleCondition, :MoSold, :YrSold]
length(factor_vars)
Out[6]:
48

Closer look at the data

According to the data documentation, "NA" of some factor variables (such as Alley, BsmtQual, and GarageType) implies absence. Hence, we convert "NA" to "None" in those variables.

In [7]:
cols_None = [:Alley, :BsmtQual, :BsmtCond, :BsmtExposure, :BsmtFinType1, 
    :BsmtFinType2, :FireplaceQu, :GarageType, :GarageFinish, :GarageQual, 
    :GarageCond, :PoolQC, :Fence, :MiscFeature] 

ames_df2 = deepcopy(ames_df)
for col in eachcol(ames_df2)
    var_name = col[1]
    if var_name in cols_None
         for row in 1:size(ames_df, 1)
             if ames_df2[row, var_name] == "NA"
                ames_df2[row, var_name] = "None"
            end
        end
    end
end
In [8]:
size(ames_df2)
Out[8]:
(2930, 80)

Some factors have very low variance. For example, $99.5\%$ of houses have "Pave" for Street variable and $99.9\%$ of houses have "AllPub" for Utilities.

Since near-zero variance predictors provide little information, we decided to eliminate variables with more than $98\%$ of their values that are the same: Street, Utilities, Condition2, RoofMatl, Heating and PoolQC.

In [9]:
vars_low_variance = [:Street, :Utilities, :Condition2, :RoofMatl, :Heating, :PoolQC]

for i in vars_low_variance
    display(i)
    display(sort(freqtable(ames_df2[:, i]), rev=true) / 2930) 
end
:Street
2-element Named Array{Float64,1}
Dim1   │ 
───────┼───────────
"Pave" │   0.995904
"Grvl" │ 0.00409556
:Utilities
3-element Named Array{Float64,1}
Dim1     │ 
─────────┼────────────
"AllPub" │    0.998976
"NoSewr" │ 0.000682594
"NoSeWa" │ 0.000341297
:Condition2
8-element Named Array{Float64,1}
Dim1     │ 
─────────┼────────────
"Norm"   │    0.989761
"Feedr"  │  0.00443686
"Artery" │  0.00170648
"PosA"   │  0.00136519
"PosN"   │  0.00136519
"RRNn"   │ 0.000682594
"RRAe"   │ 0.000341297
"RRAn"   │ 0.000341297
:RoofMatl
8-element Named Array{Float64,1}
Dim1      │ 
──────────┼────────────
"CompShg" │    0.985324
"Tar&Grv" │  0.00784983
"WdShake" │  0.00307167
"WdShngl" │  0.00238908
"ClyTile" │ 0.000341297
"Membran" │ 0.000341297
"Metal"   │ 0.000341297
"Roll"    │ 0.000341297
:Heating
6-element Named Array{Float64,1}
Dim1    │ 
────────┼────────────
"GasA"  │    0.984642
"GasW"  │  0.00921502
"Grav"  │  0.00307167
"Wall"  │  0.00204778
"OthW"  │ 0.000682594
"Floor" │ 0.000341297
:PoolQC
5-element Named Array{Float64,1}
Dim1  │ 
──────┼────────────
None  │    0.995563
"Ex"  │  0.00136519
"Gd"  │  0.00136519
"TA"  │  0.00102389
"Fa"  │ 0.000682594

Here we visualize near-zero variance predictors in bar graphs.

In [10]:
using Plots, StatsBase
gr(size=(300,300), leg=false)

@userplot MyCount
@recipe function f(mc::MyCount)
    # get the array from the args field
    arr = mc.args[1]

    T = typeof(arr)
    if T.parameters[1] == Float64
        seriestype := :histogram
        arr
    else
        seriestype := :bar
        cm = countmap(arr)
        x = sort!(collect(keys(cm)))
        y = [cm[xi] for xi=x]
        x, y
    end
end
In [11]:
for var in vars_low_variance 
    
    if var == :RoofMatl || var == :Condition2 || var == :Heating
       expr = parse("p$var = mycount(convert(Array{String}, ames_df2[:$var]), xlabel=\"$var\", 
            ylabel=\"Frequency\", xrotation=60)")
    else
       expr = parse("p$var = mycount(convert(Array{String}, ames_df2[:$var]), xlabel=\"$var\", 
            ylabel=\"Frequency\")")
    end
    eval(expr)
        
end

plot(pStreet, pUtilities, pPoolQC, pCondition2, pRoofMatl, pHeating, layout=(2,3), size=(800,400))
Out[11]:
Grvl Pave 0 500 1000 1500 2000 2500 Street Frequency AllPub NoSeWa NoSewr 0 500 1000 1500 2000 2500 Utilities Frequency Ex Fa Gd None TA 0 500 1000 1500 2000 2500 PoolQC Frequency Artery Feedr Norm PosA PosN RRAe RRAn RRNn 0 500 1000 1500 2000 2500 Condition2 Frequency ClyTile CompShg Membran Metal Roll Tar&Grv WdShake WdShngl 0 500 1000 1500 2000 2500 RoofMatl Frequency Floor GasA GasW Grav OthW Wall 0 500 1000 1500 2000 2500 Heating Frequency
In [12]:
for var in vars_low_variance 
   delete!(ames_df2, var) 
end

After eliminating those 6 variables, we are left with 73 features and the response variable SalePrice.

In [13]:
size(ames_df2)
Out[13]:
(2930, 74)
In [14]:
head(ames_df2)
Out[14]:
MSSubClassMSZoningLotFrontageLotAreaAlleyLotShapeLandContourLotConfigLandSlopeNeighborhoodCondition1BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingQCCentralAirElectricalX1stFlrSFX2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorchX3SsnPorchScreenPorchPoolAreaFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
120RL14131770NoneIR1LvlCornerGtlNAmesNorm1Fam1Story6519601960HipBrkFacePlywoodStone112TATACBlockTAGdGdBLQ639Unf04411080FaYSBrkr1656001656101031TA7Typ2GdAttchd1960Fin2528TATAP210620000NoneNone052010WD Normal215000
220RH8011622NoneRegLvlInsideGtlNAmesFeedr1Fam1Story5619611961GableVinylSdVinylSdNone0TATACBlockTATANoRec468LwQ144270882TAYSBrkr89600896001021TA5Typ0NoneAttchd1961Unf1730TATAY1400001200MnPrvNone062010WD Normal105000
320RL8114267NoneIR1LvlCornerGtlNAmesNorm1Fam1Story6619581958HipWd SdngWd SdngBrkFace108TATACBlockTATANoALQ923Unf04061329TAYSBrkr1329001329001131Gd6Typ0NoneAttchd1958Unf1312TATAY393360000NoneGar21250062010WD Normal172000
420RL9311160NoneRegLvlCornerGtlNAmesNorm1Fam1Story7519681968HipBrkFaceBrkFaceNone0GdTACBlockTATANoALQ1065Unf010452110ExYSBrkr2110002110102131Ex8Typ2TAAttchd1968Fin2522TATAY000000NoneNone042010WD Normal244000
560RL7413830NoneIR1LvlInsideGtlGilbertNorm1Fam2Story5519971998GableVinylSdVinylSdNone0TATAPConcGdTANoGLQ791Unf0137928GdYSBrkr92870101629002131TA6Typ1TAAttchd1997Fin2482TATAY212340000MnPrvNone032010WD Normal189900
660RL789978NoneIR1LvlInsideGtlGilbertNorm1Fam2Story6619981998GableVinylSdVinylSdBrkFace20TATAPConcTATANoGLQ602Unf0324926ExYSBrkr92667801604002131Gd7Typ1GdAttchd1998Fin2470TATAY360360000NoneNone062010WD Normal195500

Missing values

We find that Electrical has one missing value. We will replace the missing value with its mode "SBrkr".

In [15]:
by(ames_df2, :Electrical, nrow)
Out[15]:
Electricalx1
1SBrkr2682
2FuseA188
3FuseF50
4FuseP8
51
6Mix1
In [16]:
ames_df3 = deepcopy(ames_df2)
nrows, ncols = size(ames_df3)
for row in 1:nrows
    if isempty(ames_df3[row, :Electrical])
        ames_df3[:Electrical][row] = mode(ames_df3[:Electrical])
    end 
end
In [17]:
by(ames_df3, :Electrical, nrow)
Out[17]:
Electricalx1
1SBrkr2683
2FuseA188
3FuseF50
4FuseP8
5Mix1

Now let's inspect MasVnrType. There are 23 missing values.

In [18]:
by(ames_df3, :MasVnrType, nrow)
Out[18]:
MasVnrTypex1
1Stone249
2None1752
3BrkFace880
423
5BrkCmn25
6CBlock1

Since MasVnrType (masonry veneer type) and MasVnrArea (masonry veneer area in square feet) are related, we check if there's any discrepancy.

If MasVnrType is empty but MasVnrArea is 0, we assign MasVnrType to be none. If MasVnrType is empty but MasVnrArea is greater than 0, we assign MasVnrType to be its mode "BrkFace".

In [19]:
for i in 1:nrows
    if isempty(ames_df3[:MasVnrType][i]) && ames_df3[:MasVnrArea][i] != "NA"
        if ames_df3[:MasVnrArea][i] == 0.0
            ames_df3[:MasVnrType][i] = "None"
            println("change1")
        elseif ames_df3[:MasVnrArea][i] > 0.0
            ames_df3[:MasVnrType][i] = "BrkFace"
            println("change2")
        end
    end
end

No observation meets the above criteria.

However, there are still some discrepancies between MasVnrArea and MasVnrType. Some houses have area equal to 0 but have type that's other than "None". Other houses have type "None" but area greater than 0.

In [20]:
for i in 1:nrows
       if ames_df3[:MasVnrArea][i] == 0.0
            if ames_df3[:MasVnrType][i] != "None"
                display("house no. $i: area is 0, but type is $(ames_df3[:MasVnrType][i])")
            end
       elseif ames_df3[:MasVnrArea][i] != "NA" && ames_df3[:MasVnrArea][i] > 0.0
            if ames_df3[:MasVnrType][i] == "None"
                display("house no. $i: area is $(ames_df3[:MasVnrArea][i]), but type is none")
            end
        end
end
"house no. 364: area is 344, but type is none"
"house no. 404: area is 312, but type is none"
"house no. 442: area is 285, but type is none"
"house no. 1641: area is 0, but type is BrkFace"
"house no. 1741: area is 0, but type is BrkFace"
"house no. 1786: area is 0, but type is Stone"
"house no. 1862: area is 1, but type is none"
"house no. 1914: area is 1, but type is none"
"house no. 2004: area is 1, but type is none"
"house no. 2529: area is 288, but type is none"

As a remedy,

  • if MasVnrArea > 1 and MasVnrType = "None", we will replace MasVnrType with its mode "BrkFace".

  • if MasVnrArea = 1 and MasVnrType = "None", we will replace MasVnrArea with 0 since veneer area of 1 square foot does not really make sense.

In [21]:
for i in 1:size(ames_df3, 1)
    if ames_df3[:MasVnrArea][i] == 1 && ames_df3[:MasVnrType][i] == "None"
        ames_df3[:MasVnrArea][i] = 0
    elseif ames_df3[:MasVnrType][i] == "None" && ames_df3[:MasVnrArea][i] != "NA" && ames_df3[:MasVnrArea][i] > 1 
        ames_df3[:MasVnrType][i] = "BrkFace"       
    end
end

What do I do with houses whose MasVnrArea is 0 but MasVnrType is other than "None"?

In [22]:
for i in 1:nrows
       if ames_df3[:MasVnrArea][i] == 0.0
            if ames_df3[:MasVnrType][i] != "None"
                display("house no. $i: area is 0, but type is $(ames_df3[:MasVnrType][i])")
            end
       elseif ames_df3[:MasVnrArea][i] != "NA" && ames_df3[:MasVnrArea][i] > 0.0
            if ames_df3[:MasVnrType][i] == "None"
                display("house no. $i: area is $(ames_df3[:MasVnrArea][i]), but type is none")
            end
        end
    end
"house no. 1641: area is 0, but type is BrkFace"
"house no. 1741: area is 0, but type is BrkFace"
"house no. 1786: area is 0, but type is Stone"

We will replace zero area with the median of each type. The median MasVnrArea is 203.5 and 200.0 for "BrkFace" and "Stone", respectively.

In [23]:
[median(ames_df3[ames_df3[:MasVnrType] .== "BrkFace", :MasVnrArea]) 
    median(ames_df3[ames_df3[:MasVnrType] .== "Stone", :MasVnrArea])]
Out[23]:
2-element Array{Float64,1}:
 203.5
 200.0
In [24]:
ames_df3[[1641, 1741], :MasVnrArea] = 203.5
ames_df3[1786, :MasVnrArea] = 200.0;

Now let's inspect MasVnrType. There are 23 missing values.

In [25]:
by(ames_df3, :MasVnrType, nrow)
Out[25]:
MasVnrTypex1
1Stone249
2None1748
3BrkFace884
423
5BrkCmn25
6CBlock1

We replace missing values with its mode "None" and change MasVnrType to be 0 as well.

In [26]:
for i in 1:nrows
    if isempty(ames_df3[:MasVnrType][i])
        ames_df3[:MasVnrType][i] = "None"
        ames_df3[:MasVnrArea][i] = 0
    end
end
In [27]:
by(ames_df3, :MasVnrType, nrow)
Out[27]:
MasVnrTypex1
1Stone249
2None1771
3BrkFace884
4BrkCmn25
5CBlock1
In [28]:
size(ames_df3)
Out[28]:
(2930, 74)

Now, let's deal with the rest of missing values: "NA" or "" values in continuous variables and "" values in factor variables.

In [29]:
ames_df4 = deepcopy(ames_df3)
nrows, ncols = size(ames_df4)
Out[29]:
(2930, 74)
In [30]:
vars_with_missing = String[]
for j in 1:ncols
    for i in 1:nrows
        if (isempty(ames_df4[i, j]) || ames_df4[i, j] == "NA")
            push!(vars_with_missing, names(ames_df4)[j])
        end
    end
end
unique(vars_with_missing)
Out[30]:
18-element Array{String,1}:
 "LotFrontage" 
 "BsmtQual"    
 "BsmtCond"    
 "BsmtExposure"
 "BsmtFinType1"
 "BsmtFinSF1"  
 "BsmtFinType2"
 "BsmtFinSF2"  
 "BsmtUnfSF"   
 "TotalBsmtSF" 
 "BsmtFullBath"
 "BsmtHalfBath"
 "GarageYrBlt" 
 "GarageFinish"
 "GarageCars"  
 "GarageArea"  
 "GarageQual"  
 "GarageCond"  

18 variables have at least one missing value. Note that approximately $17\%$ of LotFrontage is missing.

In [31]:
[sort(freqtable(vars_with_missing), rev=true) sort(freqtable(vars_with_missing) / 2930, rev=true)]
Out[31]:
18×2 Named Array{Float64,2}
 Dim1 ╲ hcat │           1            2
─────────────┼─────────────────────────
LotFrontage  │       490.0     0.167235
GarageYrBlt  │       159.0    0.0542662
BsmtExposure │         4.0   0.00136519
BsmtFinType2 │         2.0  0.000682594
BsmtFullBath │         2.0  0.000682594
BsmtHalfBath │         2.0  0.000682594
GarageFinish │         2.0  0.000682594
BsmtCond     │         1.0  0.000341297
BsmtFinSF1   │         1.0  0.000341297
BsmtFinSF2   │         1.0  0.000341297
BsmtFinType1 │         1.0  0.000341297
BsmtQual     │         1.0  0.000341297
BsmtUnfSF    │         1.0  0.000341297
GarageArea   │         1.0  0.000341297
GarageCars   │         1.0  0.000341297
GarageCond   │         1.0  0.000341297
GarageQual   │         1.0  0.000341297
TotalBsmtSF  │         1.0  0.000341297

While looking into GarageYrBlt column, we encountered a weird value. For the house no. 2261, GarageYrBlt was recorded to be 2207. We assume 2007 was mistakenly inputted as 2207. The value 2007 makes sense as we note that YearBlt of the house is 2006.

In [32]:
maximum(ames_df4[find(ames_df4[:GarageYrBlt] .!= "NA"), :GarageYrBlt])
Out[32]:
2207
In [33]:
ames_df4[2259:2265, [:YearBuilt, :GarageType, :GarageYrBlt, :GarageFinish]]
Out[33]:
YearBuiltGarageTypeGarageYrBltGarageFinish
11958Attchd1958RFn
22006Attchd2007Fin
32006Attchd2207RFn
42006Attchd2006RFn
52007Attchd2007Fin
62006Attchd2007RFn
71985Attchd1985Unf

Replace the value 2207 with 2007.

In [34]:
find(ames_df4[:GarageYrBlt] .== 2207)
ames_df4[2261, :GarageYrBlt] = 2007
Out[34]:
2007

Now let's try to impute GarageYrBlt values by looking at garage-related variables. We may gain some insights from fields such as GarageType, GarageFinish, GarageQual, GarageArea or GarageCond.

If GarageYrBlt="NA" given

  • GarageType = "None" or
  • GarageFinish = "None" or
  • GarageQual = "None" or
  • GarageArea = 0 or
  • GarageCond="None",

we will replace GarageYrBlt with 0.

In [35]:
for i in 1:nrows
    if ames_df4[:GarageYrBlt][i] == "NA"
        if (ames_df4[:GarageQual][i] == "None" || 
            ames_df4[:GarageType][i] == "None" || 
            ames_df4[:GarageFinish][i] == "None" ||
            ames_df4[:GarageArea][i] == 0.0 ||
            ames_df4[:GarageCond][i] == "None")
            ames_df4[:GarageYrBlt][i] = 0.0
        end
    end
end

We still have one "NA" value for GarageYrBlt. Let's look at the house with "NA" in GarageYrBlt field.

In [36]:
find(ames_df4[:GarageYrBlt] .== "NA")
Out[36]:
1-element Array{Int64,1}:
 2237
In [37]:
ames_df4[2237, :]
Out[37]:
MSSubClassMSZoningLotFrontageLotAreaAlleyLotShapeLandContourLotConfigLandSlopeNeighborhoodCondition1BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingQCCentralAirElectricalX1stFlrSFX2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorchX3SsnPorchScreenPorchPoolAreaFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
170RM509060NoneRegLvlInsideGtlIDOTRRNorm1Fam2Story5619231999GableWd SdngPlywoodNone0TATABrkTilGdTANoALQ548Unf0311859ExYSBrkr94288601828002031Gd6Typ0NoneDetchdNANANAY1740212000MnPrvNone032007WD Alloca150909

While GarageType is "Detchd", all other information regarding garage are missing. In addition, GarageArea="NA" and GarageCars="NA" seem to suggest that there is no garage. Hence, we decide to assume there's no garage for this house.

We will make following changes for house no. 2237:

  • GarageType = "None"
  • GarageYrBlt = 0
  • GarageFinish = "None"
  • GarageCars = 0
  • GarageArea = 0
  • GarageQual = "None"
  • GarageCond = "None"
In [38]:
ames_df4[2237, :GarageType] = "None"
ames_df4[2237, :GarageYrBlt] = 0
ames_df4[2237, :GarageFinish] = "None"
ames_df4[2237, :GarageCars] = 0
ames_df4[2237, :GarageArea] = 0
ames_df4[2237, :GarageQual] = "None"
ames_df4[2237, :GarageCond] = "None";

House no. 1357 has no value for GarageFinish. Let's inspect other variables related to garage. Although GarageQual and GarageCond suggest absence of garage, we see non-zero values in GarageCars and GarageArea. Hence, we will assume garage exists for house no. 1357 and impute values for GarageYrBlt, GarageFinish, GarageQual, GarageCond.

In [39]:
display("House no. $(find(ames_df4[:GarageFinish] .== ""))")
ames_df4[find(ames_df4[:GarageFinish] .== ""), 
    [:YearBuilt, :GarageType, :GarageYrBlt, :GarageFinish, :GarageCars, :GarageArea, :GarageQual, :GarageCond]]
"House no. [1357]"
Out[39]:
YearBuiltGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCond
11910Detchd0.01360NoneNone

Since correlation between YearBuilt and GarageYrBlt is fairly high (0.84), we will fill GarageYrBlt value with YearBuilt value. The mode of each feature will be replaced for missing GarageFinish, GarageQual, and GarageCond.

In [40]:
tmp = find(ames_df4[:GarageYrBlt] .> 0)
cor(ames_df4[tmp, :YearBuilt], ames_df4[tmp, :GarageYrBlt])
Out[40]:
0.8438093399467884
In [41]:
ames_df4[1357, :GarageYrBlt] = ames_df4[1357, :YearBuilt]
ames_df4[1357, :GarageFinish] = mode(ames_df4[:GarageFinish]) # mode in the neighborhood "OldTown" 
ames_df4[1357, :GarageQual] = mode(ames_df4[:GarageQual])
ames_df4[1357, :GarageCond] = mode(ames_df4[:GarageCond])
Out[41]:
"TA"
In [42]:
ames_df4[1357, [:YearBuilt, :GarageType, :GarageYrBlt, :GarageFinish, :GarageCars, :GarageArea, :GarageQual, :GarageCond]]
Out[42]:
YearBuiltGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCond
11910Detchd1910Unf1360TATA

Now that we took care of all the garage related variables, we move onto basement related variables.

In [43]:
vars_with_missing = String[]
for j in 1:ncols
    for i in 1:nrows
        if (isempty(ames_df4[i, j]) || ames_df4[i, j] == "NA")
            push!(vars_with_missing, names(ames_df4)[j])
        end
    end
end

freqtable(vars_with_missing)
Out[43]:
12-element Named Array{Int64,1}
Dim1         │ 
─────────────┼────
BsmtCond     │   1
BsmtExposure │   4
BsmtFinSF1   │   1
BsmtFinSF2   │   1
BsmtFinType1 │   1
BsmtFinType2 │   2
BsmtFullBath │   2
BsmtHalfBath │   2
BsmtQual     │   1
BsmtUnfSF    │   1
LotFrontage  │ 490
TotalBsmtSF  │   1

Let's explore BsmtExposure variable.

In [44]:
vars_bsmt = [:BsmtCond, :BsmtExposure, :BsmtFinSF1, :BsmtFinSF2, 
        :BsmtFinType1, :BsmtFinType2, :BsmtFullBath, 
        :BsmtHalfBath, :BsmtQual, :BsmtUnfSF, :TotalBsmtSF]
missing_bsmt = sort(unique([find(isempty.(ames_df4[:, :BsmtExposure])); 
            find(ames_df4[:, :BsmtFullBath] .== "NA")]))
Out[44]:
5-element Array{Int64,1}:
   67
 1342
 1498
 1797
 2780
In [45]:
ames_df4[missing_bsmt, vars_bsmt]
Out[45]:
BsmtCondBsmtExposureBsmtFinSF1BsmtFinSF2BsmtFinType1BsmtFinType2BsmtFullBathBsmtHalfBathBsmtQualBsmtUnfSFTotalBsmtSF
1TA00UnfUnf00Gd15951595
2NANANANANANA
3NoneNone00NoneNoneNANANone00
4TA00UnfUnf00Gd725725
5TA00UnfUnf00Gd936936

Based on what we see, obs 1342 (second row in the above table) has no information about the basement. So we will assume no basement for obs 1342. For obs 1498, we change "NA" values in BsmtFullBath, BsmtHalfBath to 0.

In [46]:
ames_df4[1342, [:BsmtQual, :BsmtCond, :BsmtExposure, :BsmtFinType1, :BsmtFinType2]] = "None"
ames_df4[1342, [:BsmtFinSF1, :BsmtFinSF2, :BsmtUnfSF, :TotalBsmtSF, :BsmtFullBath, :BsmtHalfBath]] = 0
ames_df4[1498, [:BsmtFullBath, :BsmtHalfBath]] = 0;

Since the mode of BsmtExposure is "No" (no exposure), we will replace the 3 missing entries of BsmtExposure with "No".

In [47]:
by(ames_df4, :BsmtExposure, nrow)
Out[47]:
BsmtExposurex1
1Gd284
2No1906
3Mn239
4Av418
53
6None80
In [48]:
ames_df4[[67, 1797, 2780], :BsmtExposure] = "No"
Out[48]:
"No"
In [49]:
ames_df4[missing_bsmt, vars_bsmt]
Out[49]:
BsmtCondBsmtExposureBsmtFinSF1BsmtFinSF2BsmtFinType1BsmtFinType2BsmtFullBathBsmtHalfBathBsmtQualBsmtUnfSFTotalBsmtSF
1TANo00UnfUnf00Gd15951595
2NoneNone00NoneNone00None00
3NoneNone00NoneNone00None00
4TANo00UnfUnf00Gd725725
5TANo00UnfUnf00Gd936936

For house no. 445, BsmtFinType2 is missing, so we will fill the entry with its mode.

In [50]:
ames_df4[445, :BsmtFinType2] = mode(ames_df4[:BsmtFinType2])
ames_df4[445, vars_bsmt]
Out[50]:
BsmtCondBsmtExposureBsmtFinSF1BsmtFinSF2BsmtFinType1BsmtFinType2BsmtFullBathBsmtHalfBathBsmtQualBsmtUnfSFTotalBsmtSF
1TANo1124479GLQUnf10Gd16033206

Now we are left with 490 missing values in LotFrontage.

In [51]:
vars_with_missing = String[]
for j in 1:ncols
    for i in 1:nrows
        if (isempty(ames_df4[i, j]) || ames_df4[i, j] == "NA")
            push!(vars_with_missing, names(ames_df4)[j])
        end
    end
end

freqtable(vars_with_missing)
Out[51]:
1-element Named Array{Int64,1}
Dim1        │ 
────────────┼────
LotFrontage │ 490

For LotFrontage, we will replace missing value with the median LotFrontage of that neighborhood since it is unlikely for a lot to be not connected to a street.

We have a problem with the neighborhood "GrnHill" and "Landmrk". Only two houses are in "GrnHill" while only one house is in "Landmrk". The two observations from "GrnHill" and the one observation from "Landmrk" have "NA" as their LotFrontage. Since LotFrontage and LotArea are related, we will use the median LotFrontage value of the neighborhood whose median lot area is closest to the median lot area of "GrnHill" or "Landmrk", respectively.

In [52]:
ames_df4[ames_df4[:Neighborhood] .== "GrnHill", :LotFrontage]
Out[52]:
2-element Array{Any,1}:
 "NA"
 "NA"
In [53]:
ames_df4[ames_df4[:Neighborhood] .== "Landmrk", :LotFrontage]
Out[53]:
1-element Array{Any,1}:
 "NA"
In [54]:
median(ames_df4[ames_df4[:Neighborhood] .== "GrnHill", :LotArea])
Out[54]:
9001.0
In [55]:
median(ames_df4[ames_df4[:Neighborhood] .== "Landmrk", :LotArea])
Out[55]:
3612.0

The median LotArea of the neighborhood "Edwards" is closest to the median LotArea of "GrnHill". The median LotArea of the neighborhood "Greens" is closest to the median LotArea of "Landmrk".

In [56]:
## For GrnHill
# by(ames_df4, :Neighborhood, df -> abs.(median(df[:LotArea]) .- 9001))
## For Landmrk
# by(ames_df4, :Neighborhood, df -> abs.(median(df[:LotArea]) .- 3612))

by(ames_df4, :Neighborhood, df -> median(df[:LotArea]))
Out[56]:
Neighborhoodx1
1NAmes9500.0
2Gilbert9729.0
3StoneBr8118.0
4NWAmes10928.0
5Somerst8282.5
6BrDale1680.0
7NPkVill2308.0
8NridgHt12092.0
9Blmngtn3189.0
10NoRidge11400.0
11SawyerW9370.0
12Sawyer9488.0
13Greens3857.0
14BrkSide6167.5
15OldTown7800.0
16IDOTRR8250.0
17ClearCr15588.5
18SWISU7141.0
19Edwards9345.0
20CollgCr9675.0
21Crawfor11275.0
22Blueste1830.5
23Mitchel10148.5
24Timber11848.0
25MeadowV1890.0
26Veenker14512.0
27GrnHill9001.0
28Landmrk3612.0

Let's create a key that maps each neighborhood to median lot frontage.

In [57]:
# median lot frontage by neighborhood
median_lot = by(ames_df4[ames_df4[:LotFrontage] .!= "NA", :], 
    [:Neighborhood], df -> median(df[:LotFrontage]))
dict_median_lot = Dict(median_lot[i, 1] => median_lot[i, 2] for i = 1:size(median_lot, 1))
dict_median_lot["GrnHill"] = dict_median_lot["Edwards"]
dict_median_lot["Landmrk"] = dict_median_lot["Greens"]
dict_median_lot
Out[57]:
Dict{SubString{String},Float64} with 28 entries:
  "ClearCr" => 80.5
  "Blmngtn" => 43.0
  "Gilbert" => 64.0
  "Greens"  => 40.0
  "OldTown" => 60.0
  "CollgCr" => 70.0
  "GrnHill" => 65.0
  "BrDale"  => 21.0
  "IDOTRR"  => 60.0
  "Blueste" => 24.0
  "Mitchel" => 74.0
  "MeadowV" => 21.0
  "NPkVill" => 24.0
  "Edwards" => 65.0
  "StoneBr" => 60.0
  "Veenker" => 80.0
  "NWAmes"  => 80.0
  "SawyerW" => 67.0
  "Landmrk" => 40.0
  "BrkSide" => 51.0
  "Sawyer"  => 72.0
  "NAmes"   => 73.0
  "Timber"  => 82.0
  "SWISU"   => 60.0
  "NridgHt" => 92.0
  ⋮         => ⋮

Here we replace missing LotFrontage values ("NA" values) using the keys defined above.

In [58]:
for i in 1:size(ames_df4, 1)
    if ames_df4[i, :LotFrontage] == "NA"
        ames_df4[i, :LotFrontage] = dict_median_lot[ames_df4[i, :Neighborhood]]
    end
end

One last check to see if there is any missing or "NA" value in the data.

In [59]:
vars_with_missing = String[]
for j in 1:ncols
    for i in 1:nrows
        if (isempty(ames_df4[i, j]) || ames_df4[i, j] == "NA")
            push!(vars_with_missing, names(ames_df4)[j])
        end
    end
end

freqtable(vars_with_missing)
Out[59]:
0-element Named Array{Int64,1}

Finally, 0 missing or "NA" values. Moving onto outliers.

Outliers

In [60]:
ames_df5 = deepcopy(ames_df4)
size(ames_df5)
Out[60]:
(2930, 74)

As suggested by De Cock, we will remove 5 houses with more than 4000 square feet from the data set.

Potential Pitfalls (Outliers): Although all known errors were corrected in the data, no observations have been removed due to unusual values and all final residential sales from the initial data set are included in the data presented with this article. There are five observations that an instructor may wish to remove from the data set before giving it to students (a plot of SALE PRICE versus GR LIV AREA will quickly indicate these points). Three of them are true outliers (Partial Sales that likely don’t represent actual market values) and two of them are simply unusual sales (very large houses priced relatively appropriately). I would recommend removing any houses with more than 4000 square feet from the data set (which eliminates these five unusual observations) before assigning it to students.

In [61]:
scatter(ames_df5[:GrLivArea], ames_df5[:SalePrice], 
    label="", xaxis="GrLivArea", yaxis="SalePrice")
Out[61]:
1000 2000 3000 4000 5000 0 200000 400000 600000 GrLivArea SalePrice
In [61]:
deleterows!(ames_df5, find(ames_df5[:GrLivArea] .> 4000))
size(ames_df5)
Out[61]:
(2925, 74)

Transformations

Let's see how the sale prices are distributed. Clearly, positive skewed.

In [64]:
histogram(ames_df5[:SalePrice], label="", title="Sale prices of houses in AMES dataset", 
    xaxis="sale price", yaxis="frequency", size=(600,400))
Out[64]:
0 100000 200000 300000 400000 500000 600000 0 100 200 300 400 Sale prices of houses in AMES dataset sale price frequency

When plotted against Sale Price, note some variables have cone shapes: LotFrontage, LotArea, BsmtUnfSF, TotalBsmtSF, GrLivArea, GarageArea, 1stFlrSF, 2stFlrSF, MasVnrArea and OverallQual.

We will log-transform these variables ("log1p" transformation since some variables contain 0 values).

In [62]:
vars = [:LotFrontage, :LotArea, :BsmtUnfSF, :TotalBsmtSF, :GrLivArea, 
    :GarageArea, :X1stFlrSF, :X2ndFlrSF, :MasVnrArea, :OverallQual]
for i in eachindex(vars)
    expr1 = parse("p$i = scatter(ames_df5[:$(vars[i])], ames_df5[:SalePrice], 
        xlabel=\"$(vars[i])\", ylabel=\"Sale Price\", label=\"\")") 
    eval(expr1)
    expr2 = parse("plog$i = scatter(log1p.(ames_df5[:$(vars[i])]), log1p.(ames_df5[:SalePrice]), 
         xlabel=\"log1p($(vars[i]))\", ylabel=\"log1p(Sale Price)\", label=\"\")") 
    eval(expr2)
end
In [183]:
plot(p1, plog1, size=(600,400))
Out[183]:
50 100 150 200 250 300 0 100000 200000 300000 400000 500000 600000 LotFrontage Sale Price 3.5 4.0 4.5 5.0 5.5 10 11 12 13 log1p(LotFrontage) log1p(Sale Price)
In [67]:
plot(p2, plog2, size=(600,400))
Out[67]:
0 50000 100000 150000 200000 0 100000 200000 300000 400000 500000 600000 LotArea Sale Price 8 9 10 11 12 10 11 12 13 log1p(LotArea) log1p(Sale Price)
In [68]:
plot(p3, plog3, size=(600,400))
Out[68]:
0 500 1000 1500 2000 0 100000 200000 300000 400000 500000 600000 BsmtUnfSF Sale Price 0 2 4 6 10 11 12 13 log1p(BsmtUnfSF) log1p(Sale Price)
In [69]:
plot(p4, plog4, size=(600,400))
Out[69]:
0 1000 2000 3000 0 100000 200000 300000 400000 500000 600000 TotalBsmtSF Sale Price 0 2 4 6 8 10 11 12 13 log1p(TotalBsmtSF) log1p(Sale Price)
In [70]:
plot(p5, plog5, size=(600,400))
Out[70]:
500 1000 1500 2000 2500 3000 3500 0 100000 200000 300000 400000 500000 600000 GrLivArea Sale Price 6.0 6.5 7.0 7.5 8.0 10 11 12 13 log1p(GrLivArea) log1p(Sale Price)
In [71]:
plot(p6, plog6, size=(600,400))
Out[71]:
0 500 1000 1500 0 100000 200000 300000 400000 500000 600000 GarageArea Sale Price 0 2 4 6 10 11 12 13 log1p(GarageArea) log1p(Sale Price)
In [72]:
plot(p7, plog7, size=(600,400))
Out[72]:
500 1000 1500 2000 2500 3000 3500 0 100000 200000 300000 400000 500000 600000 X1stFlrSF Sale Price 6.0 6.5 7.0 7.5 8.0 10 11 12 13 log1p(X1stFlrSF) log1p(Sale Price)
In [73]:
plot(p8, plog8, size=(600,400))
Out[73]:
0 500 1000 1500 0 100000 200000 300000 400000 500000 600000 X2ndFlrSF Sale Price 0 2 4 6 10 11 12 13 log1p(X2ndFlrSF) log1p(Sale Price)
In [74]:
plot(p9, plog9, size=(600,400))
Out[74]:
0 500 1000 1500 0 100000 200000 300000 400000 500000 600000 MasVnrArea Sale Price 0 2 4 6 10 11 12 13 log1p(MasVnrArea) log1p(Sale Price)
In [75]:
plot(p10, plog10, size=(600,400))
Out[75]:
2 4 6 8 10 0 100000 200000 300000 400000 500000 600000 OverallQual Sale Price 0.75 1.00 1.25 1.50 1.75 2.00 2.25 10 11 12 13 log1p(OverallQual) log1p(Sale Price)

Since BsmtFinSF1 shows non-linear relationship, so we add a quadratic term.

In [76]:
p10 = scatter(ames_df5[:BsmtFinSF1], ames_df5[:SalePrice], 
        xlabel="BsmtFinSF1", ylabel="Sale Price", size=(600,400))
psq = scatter((ames_df5[:BsmtFinSF1]).^2, log1p.(ames_df5[:SalePrice]), 
        xlabel="BsmtFinSF1^2", ylabel="log1p(Sale Price)")
plot(p10, psq, size=(600,400))
Out[76]:
0 500 1000 1500 2000 0 100000 200000 300000 400000 500000 600000 BsmtFinSF1 Sale Price 0 1000000 2000000 3000000 4000000 5000000 10 11 12 13 BsmtFinSF1^2 log1p(Sale Price)
In [63]:
ames_df6 = deepcopy(ames_df5)
vars_skewed = [:LotFrontage, :LotArea, :BsmtUnfSF, :TotalBsmtSF, :GrLivArea, 
    :GarageArea, :X1stFlrSF, :X2ndFlrSF, :MasVnrArea, :OverallQual, :SalePrice]
vars_log = [:logLotFrontage, :logLotArea, :logBsmtUnfSF, :logTotalBsmtSF, :logGrLivArea, 
    :logGarageArea, :logX1stFlrSF, :X2ndFlrSF, :logMasVnrArea, :logOverallQual, :logSalePrice]

for var in vars_skewed
    ames_df6[Symbol(var)] = map(x -> log1p(x), ames_df6[Symbol(var)])
end 
rename!(ames_df6, f => t for (f, t) = zip(vars_skewed, vars_log))

ames_df6[:BsmtFinSF1sq] = map(x -> x^2, ames_df6[:BsmtFinSF1])
head(ames_df6)
Out[63]:
MSSubClassMSZoninglogLotFrontagelogLotAreaAlleyLotShapeLandContourLotConfigLandSlopeNeighborhoodCondition1BldgTypeHouseStylelogOverallQualOverallCondYearBuiltYearRemodAddRoofStyleExterior1stExterior2ndMasVnrTypelogMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2logBsmtUnfSFlogTotalBsmtSFHeatingQCCentralAirElectricallogX1stFlrSFX2ndFlrSFLowQualFinSFlogGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarslogGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorchX3SsnPorchScreenPorchPoolAreaFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionlogSalePriceBsmtFinSF1sq
120RL4.95582705760126110.366309203003638NoneIR1LvlCornerGtlNAmesNorm1Fam1Story1.9459101490553132519601960HipBrkFacePlywoodStone4.727387818712341TATACBlockTAGdGdBLQ639Unf06.0913098820776986.985641817639208FaYSBrkr7.41276401742656250.007.4127640174265625101031TA7Typ2GdAttchd1960Fin26.270988431858299TATAP210620000NoneNone052010WD Normal12.278397958261774408321
220RH4.3944491546724399.360741172643708NoneRegLvlInsideGtlNAmesFeedr1Fam1Story1.791759469228055619611961GableVinylSdVinylSdNone0.0TATACBlockTATANoRec468LwQ1445.6021188208797016.78332520060396TAYSBrkr6.7990558620587960.006.799055862058796001021TA5Typ0NoneAttchd1961Unf16.594413459749778TATAY1400001200MnPrvNone062010WD Normal11.561725152903833219024
320RL4.4067192472642539.565774546478782NoneIR1LvlCornerGtlNAmesNorm1Fam1Story1.9459101490553132619581958HipWd SdngWd SdngBrkFace4.6913478822291435TATACBlockTATANoALQ923Unf06.0088131854425957.1929342212158TAYSBrkr7.19293422121580.007.1929342212158001131Gd6Typ0NoneAttchd1958Unf15.746203190540153TATAY393360000NoneGar21250062010WD Normal12.055255569732177851929
420RL4.5432947822700049.320180837655714NoneRegLvlCornerGtlNAmesNorm1Fam1Story2.0794415416798357519681968HipBrkFaceBrkFaceNone0.0GdTACBlockTATANoALQ1065Unf06.9527286446248697.65491704784832ExYSBrkr7.654917047848320.007.65491704784832102131Ex8Typ2TAAttchd1968Fin26.259581464064923TATAY000000NoneNone042010WD Normal12.4049276026275971134225
560RL4.317488113536319.534667728624708NoneIR1LvlInsideGtlGilbertNorm1Fam2Story1.791759469228055519971998GableVinylSdVinylSdNone0.0TATAPConcGdTANoGLQ791Unf04.9272536851572056.834108738813838GdYSBrkr6.8341087388138386.55393340402581107.396335293800808002131TA6Typ1TAAttchd1997Fin26.180016653652572TATAY212340000MnPrvNone032010WD Normal12.15425816271595625681
660RL4.36944785246702159.208238163884312NoneIR1LvlInsideGtlGilbertNorm1Fam2Story1.9459101490553132619981998GableVinylSdVinylSdBrkFace3.044522437723423TATAPConcTATANoGLQ602Unf05.7838251823297376.831953565565855ExYSBrkr6.8319535655658556.52062112755869607.380879035564116002131Gd7Typ1GdAttchd1998Fin26.154858094016418TATAY360360000NoneNone062010WD Normal12.18332077348399362404

Encode categorical variables using dummy variables

Let's create a design matrix. First, we remove the response variable.

In [64]:
ames_df7 = deepcopy(ames_df6)
logsaleprice = ames_df7[:, :logSalePrice]
delete!(ames_df7, :logSalePrice)
size(ames_df7)
Out[64]:
(2925, 74)

We encode factor variables as dummy variables and standardize all other variables.

In [65]:
nrows, ncols = size(ames_df7)
X = zeros(nrows, 0.0)
levels = zeros(Int, ncols)
level_iter = 1
beta_names = String[]

for col in eachcol(ames_df7)
    var_name = col[1]
    darr = col[2]
    if var_name in factor_vars
        vals = sort(unique(darr))
        levels[level_iter] = length(vals)
        namedict = Dict(vals[i] => i for i = 1:length(vals))
        for idx in eachindex(vals)
            push!(beta_names, string(var_name)*string("_")*string(vals[idx]))
        end
        arr = zeros(length(darr), length(namedict))
        for k in eachindex(darr)
            if haskey(namedict, darr[k])
               arr[k, namedict[darr[k]]] = 1
            end 
        end
    else 
        #levels[col] 
        arr = zscore(convert(Array{Float64}, darr))
        push!(beta_names, string(var_name))
    end 
        X = hcat(X, arr)
        level_iter += 1
end 
X
Out[65]:
2925×324 Array{Float64,2}:
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  1.0  0.0   0.0420837 
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.280228  
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0   0.797406  
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0   1.27807   
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0   0.412178  
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  …  0.0  0.0  1.0  0.0  -0.0360983 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.00706419
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.535384  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0   1.71765   
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.653156  
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0  …  0.0  0.0  1.0  0.0  -0.653156  
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0   0.835369  
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.653156  
 ⋮                        ⋮              ⋱       ⋮                         
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.653156  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.653156  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  -0.545029  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0   0.838555  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  1.0  0.0  0.0   3.73846   
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0   0.838555  
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0   1.89776   
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  1.0  0.0   0.488935  
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.498892  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0  -0.459785  
 1.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  1.0  0.0   1.29989   
 0.0  0.0  0.0  0.0  0.0  1.0  0.0  0.0     0.0  0.0  1.0  0.0   0.325142  

Get equality constraints

Now we create a matrix for the equality constraint. Here we impose a sum-to-zero constraint within each factor, which alleviates the need to choose a reference level.

In [66]:
nrowsX, ncolsX = size(X)
Aeq = zeros(eltype(X), length(factor_vars), ncolsX)
beq = zeros(eltype(X), length(factor_vars))
prev = 0
row_Aeq = 1
for i in eachindex(levels)
   if levels[i] > 0
       for j in 1:levels[i] 
            Aeq[row_Aeq, prev + j] = one(eltype(X))
       end 
       prev += levels[i]
       row_Aeq += 1
    else 
       prev += 1
    end 
end

Aeq
Out[66]:
48×324 Array{Float64,2}:
 1.0  1.0  1.0  1.0  1.0  1.0  1.0  1.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 ⋮                        ⋮              ⋱                 ⋮                 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  1.0  1.0  1.0  1.0  1.0  1.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0  …  0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0
 0.0  0.0  0.0  0.0  0.0  0.0  0.0  0.0     0.0  0.0  0.0  0.0  0.0  0.0  0.0

We also standardize the log-transformed response vector.

In [67]:
y = zscore(logsaleprice)
Out[67]:
2925-element Array{Float64,1}:
  0.636705 
 -1.12844  
  0.0871112
  0.948344 
  0.330952 
  0.402532 
  0.619461 
  0.351617 
  0.87145  
  0.319251 
  0.142334 
  0.266566 
  0.20455  
  ⋮        
 -1.81364  
 -1.49446  
 -2.09214  
 -0.235234 
  0.306185 
 -0.0910116
 -0.583545 
 -0.3763   
 -0.583545 
 -0.564815 
  0.0583043
  0.306185 

Path algorithm

Now, let's run the path algorithm to obtain the solution path.

In [69]:
betapath, rhopath, objpath, _, _, dfpath,  = lsq_classopath(X, y; Aeq = Aeq, beq = beq)
WARNING: Adding a small ridge penalty (default is 1e-4) since X is rank deficient
WARNING: ρridge must be positive, switching to default value (1e-4)
WARNING: Problem status Suboptimal; solution may be inaccurate.
Out[69]:
([0.0 0.0 … 0.222799 0.221748; 0.0 0.0 … 0.173881 0.172829; … ; 0.0 0.0 … -0.0461544 -0.0461528; 0.0 0.0 … 0.0345262 0.0345262], [2385.41, 1759.5, 1294.33, 1140.49, 978.208, 743.805, 715.701, 556.868, 504.487, 420.266  …  0.00316321, 0.00253488, 0.000890262, 0.000398577, 0.000310671, 0.000144612, 0.00013702, 0.00013636, 0.00010229, 0.0], [1462.0, 1395.01, 1248.63, 1178.97, 1092.28, 940.081, 919.561, 792.832, 746.944, 668.266  …  80.9425, 80.9132, 80.8354, 80.8119, 80.8077, 80.7997, 80.7993, 80.7993, 80.7976, 80.7973], [-0.448505 -0.448505 … 0.0 0.0; 0.23278 0.23278 … 0.0 0.0; … ; 0.0 0.0 … 0.0 0.0; 0.0 0.0 … 0.0 0.0], Array{Float64}(0,1620), [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  274.0, 275.0, 276.0, 277.0, 276.0, 278.0, 279.0, 280.0, 279.0, 278.0], [0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0  …  Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf, Inf])

Here is the path of coefficients. Column i has coefficient estimates at $\rho=$rhopath[i].

In [70]:
betapath
Out[70]:
324×504 Array{Float64,2}:
 0.0  0.0  0.0  0.0  0.0  0.0  0.0         …   0.222799     0.221748  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.173881     0.172829  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.243122     0.242073  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.334988     0.33394   
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.272055     0.271005  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0         …   0.128966     0.127915  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.280608     0.279555  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.303391     0.30234   
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.0645725    0.0635211 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.195319     0.194266  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0         …  -2.57985     -2.56409   
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.192231     0.191185  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0            -0.13708     -0.138153  
 ⋮                        ⋮                ⋱                          
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.0517739    0.0517746 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.0957349    0.0957392 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.00968892   0.00968882
 0.0  0.0  0.0  0.0  0.0  0.0  0.0         …  -0.145747    -0.145772  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0            -0.0444471   -0.0444424 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0            -0.174146    -0.174144  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.289564     0.289562  
 0.0  0.0  0.0  0.0  0.0  0.0  0.0            -0.00712951  -0.00713198
 0.0  0.0  0.0  0.0  0.0  0.0  0.0         …  -0.0910395   -0.0910384 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0             0.0289046    0.0289056 
 0.0  0.0  0.0  0.0  0.0  0.0  0.0            -0.0461544   -0.0461528 
 0.0  0.0  0.0  0.0  0.0  0.0  0.00552364      0.0345262    0.0345262 
In [71]:
rhopath
Out[71]:
504-element Array{Float64,1}:
 2385.41       
 1759.5        
 1294.33       
 1140.49       
  978.208      
  743.805      
  715.701      
  556.868      
  504.487      
  420.266      
  241.298      
  192.725      
  167.438      
    ⋮          
    0.00371971 
    0.00317288 
    0.00316321 
    0.00253488 
    0.000890262
    0.000398577
    0.000310671
    0.000144612
    0.00013702 
    0.00013636 
    0.00010229 
    0.0        

In order to find the optimal tuning parameter $\rho$, we calculate BIC at each $\rho$ using degrees of freedom and objective values we obtained from lsq_classopath.

In [72]:
BICpath = log(nrowsX) * dfpath + 2 * objpath
Out[72]:
504-element Array{Float64,1}:
 2924.0  
 2790.02 
 2497.25 
 2357.94 
 2184.55 
 1880.16 
 1839.12 
 1585.66 
 1493.89 
 1336.53 
  958.154
  844.729
  783.583
    ⋮    
 2348.74 
 2340.71 
 2348.69 
 2356.62 
 2364.44 
 2372.37 
 2364.39 
 2380.33 
 2388.31 
 2396.29 
 2388.31 
 2380.33 

Minimum BIC is 408.65, which is obtained at $\rho = 28.853.$

In [73]:
_, optidx = findmin(BICpath)
Out[73]:
(408.65314600541274, 41)
In [74]:
rhopath[optidx]
Out[74]:
28.85335904855635

Now let's plot coefficients against $\log(\rho)$. The dotted line indicates coefficients at the optimal $\rho$ we found above.

In [205]:
using Plots;  pyplot();
log_rho = [log.(rhopath[1:end-1]); -10] 
plot(log_rho, betapath', label="", xaxis = "log(ρ)", yaxis = "β̂(ρ)", width=0.9, label="") 
title!("AMES Housing Data Solution Path using classopath")
plot!(fill(log_rho[optidx], 20), linspace(minimum(betapath), maximum(betapath), 20), 
    color="black", label="ρ=28.853", linestyle=:dot, linewidth=1.3, size=(600,400))
Out[205]:

This time we plot coefficients against its $\ell_1$ norm. As before, the dotted line indicates coefficients at the optimal $\rho$.

In [206]:
L1_beta = mapslices(sum, abs.(betapath'), 2)
plot(L1_beta, betapath', label="", xaxis = "||β̂(ρ)||₁ ", 
    yaxis = "β̂(ρ)", label="", flip=true, width=0.9) 
title!("AMES Housing Data Solution Path using classopath")
plot!(fill(L1_beta[optidx], 20), linspace(minimum(betapath), maximum(betapath), 20), 
    color="black", label="", linestyle=:dot, size=(600,400), width=1.3)
Out[206]:

Optimal model by BIC

Now, let's inspect the variables selected at $\rho=28.853.$ There is a total of 38 non-zero features.

In [75]:
idx_selected = find(x -> x != 0, betapath[:, optidx])
selectedCovs = [beta_names[idx_selected] betapath[idx_selected, optidx]]
p = sortperm(selectedCovs[:, 2], by=abs, rev=true)
selectedCovs = [convert(Array{String}, selectedCovs[p, 1]) convert(Array{Float64}, selectedCovs[p, 2])]
Out[75]:
38×2 Array{Any,2}:
 "logOverallQual"          0.31048    
 "logGrLivArea"            0.28171    
 "YearBuilt"               0.134894   
 "BsmtFinSF1"              0.101268   
 "logLotArea"              0.0957966  
 "YearRemodAdd"            0.0930297  
 "SaleCondition_Abnorml"  -0.0858377  
 "SaleCondition_Partial"   0.0732245  
 "GarageCars"              0.0722392  
 "logX1stFlrSF"            0.0667617  
 "Fireplaces"              0.0443423  
 "KitchenQual_Ex"          0.0430633  
 "KitchenQual_TA"         -0.0430633  
 ⋮                                    
 "SaleCondition_Normal"    0.0126133  
 "logLotFrontage"          0.0117567  
 "CentralAir_N"           -0.0101404  
 "CentralAir_Y"            0.0101404  
 "OverallCond_5"          -0.00924272 
 "OpenPorchSF"             0.00715994 
 "OverallCond_3"          -0.00668742 
 "MSZoning_RL"             0.00426535 
 "MSZoning_RM"            -0.00426535 
 "BsmtExposure_Gd"         0.00266953 
 "BsmtExposure_No"        -0.00266953 
 "BedroomAbvGr"           -0.000566359
In [208]:
using ColorSchemes, Colors; gr();
In [209]:
col1 = get(ColorSchemes.rainbow, 0.9)
col2 = get(ColorSchemes.rainbow, 0.31)
col = fill(col2, 38)
idx_neg = find(selectedCovs[:, 2] .< 0)
col[idx_neg] = col1;
bar(convert(Array{String}, selectedCovs[:, 1]), convert(Array{Float64}, selectedCovs[:, 2]), size=(600, 400), 
    xrotation=60, tickfont=7, xticks=(0:1:38, convert(Array{String}, selectedCovs[:, 1])), fillcolor=col,
    title="Selected coefficient estimates at rho=28.853")
Out[209]:
logOverallQual logGrLivArea YearBuilt BsmtFinSF1 logLotArea YearRemodAdd SaleCondition_Abnorml SaleCondition_Partial GarageCars logX1stFlrSF Fireplaces KitchenQual_Ex KitchenQual_TA KitchenAbvGr logTotalBsmtSF HeatingQC_TA HeatingQC_Ex ScreenPorch WoodDeckSF BsmtFinSF1sq BsmtFinSF2 BsmtFullBath OverallCond_7 BsmtQual_Ex BsmtQual_Gd logGarageArea SaleCondition_Normal logLotFrontage CentralAir_N CentralAir_Y OverallCond_5 OpenPorchSF OverallCond_3 MSZoning_RL MSZoning_RM BsmtExposure_Gd BsmtExposure_No BedroomAbvGr 0.0 0.1 0.2 0.3 Selected coefficient estimates at rho=28.853

Looking at the above bar chart, the most influential variables include overall quality score (logOverallQual), living area (logGrLivArea), and year the house was built (YearBuilt). Note the orange bars indicate variables with negative coefficient estimates. It makes sense that abnormal sale condition such as trade, foreclosure, short sale (SaleCondition_Abnormal) is associated with lower sale price. One interesting variable with negative coefficient is KitchenAbvGr. KitchenAbvGr is a discrete variable that counts the number of kitchens above grade, or above the ground. In general, one would expect to see one kitchen in a house, and more kitchens for a larger (and potentially more expensive) house. Hence, it is odd that the coefficient estimate for KitchenAbvGr was negative. When looking at the distribution of KitchenAbvGr, most houses have 1 kitchen. Two houses have the largest number of kitchens above the ground, which is 3.

In [210]:
sort(countmap(ames_df5[:KitchenAbvGr]), rev=true)
Out[210]:
DataStructures.OrderedDict{Any,Int64} with 4 entries:
  3 => 2
  2 => 129
  1 => 2791
  0 => 3

When we inspect the sale price of houses with respect to the number of kitchens above ground, the two houses with the largest number of kitchens have a unusually low sale price.

In [211]:
scatter1 = scatter(ames_df5[:KitchenAbvGr], ames_df5[:SalePrice], 
    xlabel="KitchenAbvGr", ylabel="SalePrice", size=(500,300))
Out[211]:
0 1 2 3 0 100000 200000 300000 400000 500000 600000 KitchenAbvGr SalePrice

These two outliers (3 kitchens but very low sale price) can explain why KitchenAbvGr has negative coefficient estimate.

In [212]:
ames_df5[ames_df5[:KitchenAbvGr] .== 3, :]
Out[212]:
MSSubClassMSZoningLotFrontageLotAreaAlleyLotShapeLandContourLotConfigLandSlopeNeighborhoodCondition1BldgTypeHouseStyleOverallQualOverallCondYearBuiltYearRemodAddRoofStyleExterior1stExterior2ndMasVnrTypeMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2BsmtUnfSFTotalBsmtSFHeatingQCCentralAirElectricalX1stFlrSFX2ndFlrSFLowQualFinSFGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarsGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorchX3SsnPorchScreenPorchPoolAreaFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionSalePrice
1190RM334456NoneRegLvlInsideGtlOldTownNorm2fmCon2Story4519202008GableMetalSdMetalSdNone0TATABrkTilTATANoUnf0Unf0736736GdYSBrkr73671601452002023TA8Typ0NoneNone0.0None00NoneNoneN00102000NoneNone062009NewPartial113000
275RM908100NoneRegLvlCornerGtlOldTownNorm1Fam2.5Unf5518981965HipAsbShngAsbShngNone0TATAPConcTATANoUnf0Unf0849849TANFuseA1075106302138002023TA11Typ0NoneDetchd1910Unf2360FaPoN401560000MnPrvNone0112009WD Normal106000

Below we have the observed vs. fitted values plot. It appears that fitted values are close to the actual values except for the two values deviate far from the line.

In [76]:
fitted = X * betapath[:, optidx];
scatter(fitted, y, label="", xaxis = ("Fitted"), yaxis = ("Observed"), alpha=0.5) 
title!("Observed-Fitted Plot")
plot!(linspace(minimum(fitted), maximum(fitted), 20), linspace(minimum(fitted), maximum(fitted), 20), 
    label="", linestyle=:dot, linecolor=:red, size=(600,400), linewidth=3)
Out[76]:
-4 -2 0 2 -6 -4 -2 0 2 Observed-Fitted Plot Fitted Observed

Let's inspect the two outliers.

In [214]:
ames_df6[find(y .< -6), :]
Out[214]:
MSSubClassMSZoninglogLotFrontagelogLotAreaAlleyLotShapeLandContourLotConfigLandSlopeNeighborhoodCondition1BldgTypeHouseStylelogOverallQualOverallCondYearBuiltYearRemodAddRoofStyleExterior1stExterior2ndMasVnrTypelogMasVnrAreaExterQualExterCondFoundationBsmtQualBsmtCondBsmtExposureBsmtFinType1BsmtFinSF1BsmtFinType2BsmtFinSF2logBsmtUnfSFlogTotalBsmtSFHeatingQCCentralAirElectricallogX1stFlrSFX2ndFlrSFLowQualFinSFlogGrLivAreaBsmtFullBathBsmtHalfBathFullBathHalfBathBedroomAbvGrKitchenAbvGrKitchenQualTotRmsAbvGrdFunctionalFireplacesFireplaceQuGarageTypeGarageYrBltGarageFinishGarageCarslogGarageAreaGarageQualGarageCondPavedDriveWoodDeckSFOpenPorchSFEnclosedPorchX3SsnPorchScreenPorchPoolAreaFenceMiscFeatureMiscValMoSoldYrSoldSaleTypeSaleConditionlogSalePriceBsmtFinSF1sq
130RM4.234106504597269.175438319966918NoneRegLvlInsideGtlOldTownNorm1Fam1Story1.0986122886681096219231970GableAsbShngAsbShngNone0.0TAFaBrkTilFaFaNoUnf0Unf06.5206211275586966.520621127558696TANSBrkr6.7250336421668430.006.725033642166843001021TA5Typ1GdDetchd1928Unf26.660575149839686FaFaN000000NoneNone062010WD Abnorml9.4564188945728880
220A (agr)4.3944491546724399.587748882301822NoneRegLowInsideModIDOTRRNorm1Fam1Story0.6931471805599453519521952GableAsbShngVinylSdNone0.0FaPoSlabNoneNoneNoneNone0None00.00.0PoNFuseA6.5985090286145150.006.598509028614515001021Fa4Sal0NoneAttchd1952Unf26.1903154058531475FaPoN000000NoneNone022008WD Abnorml9.480443842153670
In [215]:
ames_df6[find(y .< -6), :SaleCondition]
Out[215]:
2-element Array{Any,1}:
 "Abnorml"
 "Abnorml"

Among other variables, it is noteworthy that the sale conditions of those two values are "Abnormal". According to the data documentation, abnormal conditions include trade, foreclosure, and short sale. It makes sense that the prices of those two values are so low.

Below is the residual vs. fitted values plot where we do not see a distinct pattern in scatter of the residuals. As in the observed vs. fitted values plot, we see that two residuals that deviate far from zero.

In [216]:
scatter(fitted, y - fitted, label="", xaxis = ("Fitted Value"), 
    yaxis = ("Residual"), alpha=0.5) 
title!("Residual-Fitted Plot")
plot!(linspace(minimum(fitted), maximum(fitted), 20), fill(0, 20), 
    linecolor=:red, linestyle=:dot, size=(600,400), linewidth=3)
Out[216]:
-4 -2 0 2 -4 -3 -2 -1 0 1 Residual-Fitted Plot Fitted Value Residual

Finally, we calculate the predictive $R^2$, which depends on "PRESS" statistic. PRESS statistic is defined as $$\sum_{i=1}^n (y_i - \widehat{y}_{i, -i})^2$$ where $y_i$ is the outcome of the $i$-th data point, and $\widehat{y}_{i, -i}$ is the prediction for $y_i$ from a model that is trained using all the data except the $i$-th data point.

In [ ]:
PRESS = 0
for i in 1:size(X, 1)
    # build a model using all but i-th row 
    beta, = lsq_constrsparsereg(X[1:end .!= i, :], y[1:end .!= i], rhopath[optidx]; 
        Aeq = Aeq, beq = beq);
    # predict outcome 
    pred = dot(X[i, :], beta)
    # sum the squared deviations 
    PRESS += (pred - y[i])^2    
end
In [102]:
PRESS
Out[102]:
312.9013042952766

The predictive $R^2$ is the PRESS divided by the total sum of squares, subtracted from one. We obtain $0.893$ for the predictive $R^2.$

In [106]:
SST = sum(abs2.(y))
predR2 = 1 - PRESS / SST
Out[106]:
0.8929886100221353

Now we output the data in four text files.

In [93]:
writedlm("data4beta.txt", betapath)
writedlm("data4rho.txt", rhopath)
writedlm("data4fitted.txt", [fitted y])
writedlm("data4selected.txt", selectedCovs)

Refitted Lasso

We refit the linear model with non-zero variables at optimal $\rho=28.853$. $R^2$ from unregularized least squares equals to 0.91.

In [107]:
# run regular least squares using nonzero covariates 
a = llsq(X[:, idx_selected], y; bias=false)
n, p = size(X[:, idx_selected])

# do prediction 
yp = X[:, idx_selected] * a

# sum of squares 
SS_tot = sum(abs2.(y))
SS_reg = sum(abs2.(yp))
SS_res = sum(abs2.(y - yp))

# deviance 
null_dev = SS_tot
deviance = SS_res

# R-squared from unregularized least squares 
1 - SS_res / SS_tot
Out[107]:
0.9105641248189718
In [108]:
null_dev, deviance
Out[108]:
(2924.0, 261.5104990293267)

When we conduct F test based on values obtained above, we get a p-value close to 0.

In [221]:
# F statistic with df (p-1, n-p)
F_stat = (SS_reg / (p - 1)) / (SS_res / (n - p))

# p-value from F test
fdistccdf(p - 1, n - p, F_stat)
Out[221]:
0.0

However, this F-test is problematic as a selection process preceded the inference. Post-selection inference should follow a valid procedure (Taylor and Tibshirani, 2018).

References

De Cock, D. (2011). "Ames, Iowa: Alternative to the Boston housing data as an end of semester regression project." Journal of Statistics Education, 19(3).

Taylor, J. and Tibshirani, R. (2018), "Post-selection inference for $\ell$1-penalized likelihood models," Canadian Journal of Statistics, 46, 41–61.