Prediction with classification

Author: Daniel Martinez

Project 4 provides a dataset of a house built in the state of colorado to predict the houses are created before 1980, this problem will be resolved classification model, the classification model that is used is a GradientBoostingClassifier().

Elevator Pitch

The objective of the project is to create a prediction for houses built before 1980 the classification model provides an accuracy of 91% with the correlation feature that are the number of bathrooms and square footage.

The evaluation of the model provides a 96% of accuracy score or AUC.

Technical Details

The installation of library scikit-learn in your python environment using the following code:

pip install -U scikit-learn

Note:

the installation of the scikit-learn provides access to the classification model and regression model part of the classification model as follows:

  1. DecisionTreeClassifier()
  2. GaussianNB()
  3. GradientBoostingClassifier()
  4. Others

the importance of split the data to have a segmentation of the data in test and train data the data for evaluation is the test data and normally is and an average of the 30% of the total of the data we use the following example:

X_train, X_test, y_train, y_test = train_test_split(X_pred, y_pred, test_size=0.34, random_state=76)

The "X' by convention is considerate the feature data and the 'y' is considered the target variable or the potential prediction.


Grand Questions

Grand Questions:

  1. Create 2-3 charts that evaluate potential relationships between the home variables and before1980.

In the graph, we see the relationship between the year of build with the number of bath that information can help to create a model for predict the year of the build before 1980. proving the information that before 1980 the number of the bath are lest in comparision after 1980.

In the graph we see the relationship between the year of build with square footage that information can help to create a model for predicting the year of the build before 1980, the graph provides information that before 1980 the houses are bigger in comparison with house build after 1980.

In the graph, we see the relationship between the year of build with the number of bath that information can help to create a model for predict the year of the build before 1980.

  1. Can you build a classification model (before or after 1980) that has at least 90% accuracy for the state of Colorado to use (explain your model choice and which models you tried)?
X_train, X_test, y_train, y_test = train_test_split(X_pred, y_pred, test_size=0.34, random_state=76)
clf_df = GradientBoostingClassifier()#DecisionTreeClassifier(random_state=70)#inizalize the decision tree
clf_df = clf_df.fit(X_train, y_train)
#clf = GradientBoostingClassifier()
#clf = clf.fit(X_train, y_train)
#predict_p =  clf.predict(X_test)
#clf_df = classifier
# %%
y_pred =clf_df.predict(X_test)

I create a model to create a prediction with 91% of accuracy to have these values I tried a different classification model if we consider that the data before 1980 proving only two values 0 is not before 1980 and 1 is before 1980, the first classification model I did was DecisionTreeClassifie() that model was with 90% of accuracy but the problem with this model is that I was unable to plot the decision tree that the reason I think to create another model, the second classification model I create was GradientBoostingClassifier() that one provides an accuracy of 91%, the last classification model that I use was GaussianNB() but was not too efficient comparison with GradientBoostingClassifier().

GradientBoostingClassifier() was my final choice.

precision recall f1-score support
0 0.89 0.85 0.87 2572
1 0.91 0.94 0.92 4302
accuracy 0.90 6874
macro avg 0.90 0.89 0.89 6874
weighted avg 0.90 0.90 0.90 6874
  1. Will you justify your classification model by detailing the most important features in your model (a chart and a description are a must)?
    The classification model one of the most important features to find the prediction that is before1980 and yrbuilt as the heatmap the highest correlation with that target variable are numbdrm and finbsmnt that two have the highest correlation that the reason was to use it for the graph above.

The pairplot provides crucial information relevant to the correlation of the features with the predicted variable or target variable.

4. Can you describe the quality of your classification model using 2-3 evaluation metrics? You need to provide an interpretation of each evaluation metric when you provide the value.

The evaluation of the performance of the model is provide by AUC with a 96%.


This graph provides the most important feature of the test dataset.


Appendix python code

# %% 
#import sys
#!{sys.executable} -m pip install seaborn scikit-learn
#instalation of seaborn. 

# %%
import pandas as pd
import numpy as np 
import altair as alt
#import matplotlib 
import seaborn as sns
# %%
# from sklearn
from sklearn.model_selection import train_test_split # to split the test and training data and run (creation of model selection)
from sklearn import tree
from sklearn.naive_bayes import GaussianNB
from sklearn.ensemble import GradientBoostingClassifier
from sklearn import metrics
from sklearn.metrics import r2_score

from sklearn.tree import DecisionTreeClassifier # to build a classification tree
from sklearn.tree import plot_tree
from sklearn.model_selection import cross_val_score
from sklearn.metrics import confusion_matrix
from sklearn.metrics import plot_confusion_matrix
# %%
#
dwellings_denver = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv")
dwellings_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv")
dwellings_neighborhoods_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv")   
alt.data_transformers.enable('json') 
#df = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv")
# %%
dwellings_ml.head(5)

# %% 
# create a new 
df = dwellings_ml.filter(['livearea', 'finbsmnt', 
    'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 
    'stories', 'yrbuilt', 'before1980']).sample(1000)

# %% 
df
# %%
dwellings_ml.info()
# %%
dwellings_ml['livearea'].unique() 
#this dataset did not have any object datatype column. 
# %%
#to review if I have any missing values 
df.isnull().sum()
# %%
df.isnull().values.any()
# the data did have any null values 
#this Justify that my filter dataset did not have any missing values 
# %%
len(dwellings_ml)
# %%

df_duplicate= df.duplicated()
sum(df_duplicate)
# %%
df.shape
# %%
df.info()

# %%
df.describe()
# %%
df.skew()
#calculate that Zcode 
# %%
sns.pairplot(df, hue = 'before1980')
corr = df.drop(columns = 'before1980').corr()
#heatmap.save("screenshot/heatmap.png")
# %%
sns.heatmap(corr, annot = True)

# %%
dwellings_ml.nunique()
#This 
#%%
df.columns
# %%
#alt.Chart(df).mark_circle(size=50).encode(
#    x="yrbuilt", 
#    y="before1980",
#    color ="Original"#,
    #tooltip = []
#)#.interactive()
# %%
graph1= (alt.Chart(df)
    .encode(
        x = alt.X('yrbuilt', scale=alt.Scale(zero=False), axis=alt.Axis(format='.0f')), 
        y = alt.Y('finbsmnt', scale=alt.Scale(zero=False)), 
        color = 'before1980:O')
    .mark_circle(size=50)
    .properties(width=800,
        height=600,
        title="year of built comparation with square footage")
    )
graph1.save("screenshot/graph1.png")
graph1
# %%
dat_count = (df
    .groupby(['yrbuilt', 'numbaths'])
    .agg(count = ('nocars', 'size'))
    .reset_index())
    
# %%
graph3=(alt.Chart(dat_count)
    .encode(
        alt.X('yrbuilt:O',
            scale = alt.Scale(zero=False),
            axis=alt.Axis(format='.0f')), 
        alt.Y('numbaths:O',scale = alt.Scale(zero=False)), 
        color = alt.Color('count', 
            scale=alt.Scale(type='log')))
    .mark_rect()
    .properties(width=800,
        height=600,
        title="Year of Built in base of number of bath")
        
)
hart_two = alt.Chart().mark_rule().encode(
    x='a')
(graph3+hart_two).facet(row='site',data=dat_count)
graph3.save("screenshot/graph3.png")
graph3
# %%
boxplot = (alt.Chart(dat_count)
    .encode(
        alt.X('yrbuilt:O',
            scale = alt.Scale(zero=False),
            axis=alt.Axis(format='.0f')), 
        alt.Y('numbaths',scale = alt.Scale(zero=False)))
    .mark_boxplot(size = 3)
    .properties(width=650, 
        height=600,
        title="year built and number of baths")
        
)

boxplot.save("screenshot/boxplot.png")
boxplot
# %%
#for col in df.select_dtypes(include=['int']).columns():
#    print('We have {} unique values in  {} column : {}'.format(len(df[col].unique()),col, df[col].unique()))
#    print('---'*30)
# %%
df[df.columns[:]].corr()['before1980'][:]
# %%
# formatting data ready fof the decision tree
X_pred = dwellings_ml.drop(['yrbuilt','before1980'], axis=1)
y_pred = dwellings_ml.before1980
# %%
X_pred.head().T
# %%
y_pred= dwellings_ml['before1980'].copy()
y_pred.head()
#data that we want to predict 
# %%
#one-Hot Encoding, as the dataset looks like the dataset
# is ready to use with the one-Hot Encoding. 
# this dataset did not have any object data to transform to the
# to categorical data and use of get_dummies().
# Preeliminary Decision Tree Classifier.

X_train, X_test, y_train, y_test = train_test_split(X_pred, y_pred, test_size=0.34, random_state=76)
clf_df = GradientBoostingClassifier()#DecisionTreeClassifier(random_state=70)#inizalize the decision tree
clf_df = clf_df.fit(X_train, y_train)
#clf = GradientBoostingClassifier()
#clf = clf.fit(X_train, y_train)
#predict_p =  clf.predict(X_test)
#clf_df = classifier
# %%
print(y_test.head(10))

# %%
print(X_train.head(10))
# %%
y_pred =clf_df.predict(X_test)
# %%
# prediction of 90%
print(metrics.classification_report(y_test, y_pred))

#df4 = pd.DataFrame(get_classification_report)
#print(report.to_markdown())
#y_pred
# %%
print(metrics.confusion_matrix(y_test, y_pred))
metrics.plot_confusion_matrix(clf_df, X_test, y_test, display_labels=['Does not have before1980', 'has before1980']) 
# %% 
#plot the roc curve
metrics.plot_roc_curve(clf_df, X_test, y_test)

# %%
plot_df_features = pd.DataFrame(
    {'f_names': X_train.columns, 
    'f_values': clf_df.feature_importances_}).sort_values('f_values', ascending = False)
df=(alt.Chart(plot_df_features.query('f_values > .011'))
    .encode(
        alt.X('f_values'),
        alt.Y('f_names', sort = '-x'))
    .mark_bar())

df.save("screenshot/features.png")
# %% 
df_features = (pd.DataFrame(
        {'f_names': X_train.columns, 
        'f_values': clf_df.feature_importances_})
    .sort_values('f_values', ascending = False))
df_features
# %%
#import matplotlib.pyplot as plt

# %%
#plt.figure(figsize=(15,7.5))
#plot_tree(clf_df, 
#        filled=True, 
#        rounded=True, 
#        class_names=['NO','YES'], 
#        feature_names = X_pred.columns);
# %%
#this desicion tree need optimization due to 

#print(classification_report(y_test, predictions))
# %%
#path = clf_df.cost_complexity_pruning_path(X_train, y_train)
#ccp_alphas = path.ccp_alphas
#ccp_alphas= ccp_alphas[:-1]

#clf_dts= []

#for ccp_alpha in ccp_alphas:
#    clf_dt = DecisionTreeClassifier(random_state= 0, ccp_alpha=ccp_alpha)
#    clf_dt.fit(X_train, y_train)
#    clf_dts.append(clf_dt)
# %%
#train_score = [clf_dt.score(X_train, y_train) for clf_dt in clf_dts]
#test_score = [clf_dt.score(X_test, y_test) for clf_dt in clf_dts]

#fig, ax = plt.subplots()
#ax.set_xlabel('alpha')
#ax.set_ylabel('accurancy')
#ax.set_title('Accurancy vs alpha for train and test')
#ax.plot(ccp_alphas, train_score, marker='o', label="train", drawstyle="steps-post")
#ax.plot(ccp_alphas, test_score, marker='o', label="test", drawstyle="steps-post")
#ax.legend()
#plt.show()
# %%
#cross Validation
#clf_dt = DecisionTreeClassifier(random_state=42, ccp_alpha=0.014)

#scores = cross_val_score(clf_dt, X_train, cv=5)
#df1 = pd.DataFrame(data = {'tree':range(5), 'accuracy':scores})

#df1.plot(x='tree', y='accuracy', marker='o', linestyle='--')
# %%