Author: Daniel Martinez
Project 4 provides a dataset of a house built in the state of colorado to predict the houses are created before 1980, this problem will be resolved classification model, the classification model that is used is a GradientBoostingClassifier().
The objective of the project is to create a prediction for houses built before 1980 the classification model provides an accuracy of 91% with the correlation feature that are the number of bathrooms and square footage.
The installation of library scikit-learn in your python environment using the following code:
pip install -U scikit-learn
Note:
the installation of the scikit-learn provides access to the classification model and regression model part of the classification model as follows:
the importance of split the data to have a segmentation of the data in test and train data the data for evaluation is the test data and normally is and an average of the 30% of the total of the data we use the following example:
X_train, X_test, y_train, y_test = train_test_split(X_pred, y_pred, test_size=0.34, random_state=76)
The "X' by convention is considerate the feature data and the 'y' is considered the target variable or the potential prediction.
Grand Questions:
In the graph, we see the relationship between the year of build with the number of bath that information can help to create a model for predict the year of the build before 1980. proving the information that before 1980 the number of the bath are lest in comparision after 1980.
In the graph we see the relationship between the year of build with square footage that information can help to create a model for predicting the year of the build before 1980, the graph provides information that before 1980 the houses are bigger in comparison with house build after 1980.
In the graph, we see the relationship between the year of build with the number of bath that information can help to create a model for predict the year of the build before 1980.
X_train, X_test, y_train, y_test = train_test_split(X_pred, y_pred, test_size=0.34, random_state=76) clf_df = GradientBoostingClassifier()#DecisionTreeClassifier(random_state=70)#inizalize the decision tree clf_df = clf_df.fit(X_train, y_train) #clf = GradientBoostingClassifier() #clf = clf.fit(X_train, y_train) #predict_p = clf.predict(X_test) #clf_df = classifier # %% y_pred =clf_df.predict(X_test)
I create a model to create a prediction with 91% of accuracy to have these values I tried a different classification model if we consider that the data before 1980 proving only two values 0 is not before 1980 and 1 is before 1980, the first classification model I did was DecisionTreeClassifie() that model was with 90% of accuracy but the problem with this model is that I was unable to plot the decision tree that the reason I think to create another model, the second classification model I create was GradientBoostingClassifier() that one provides an accuracy of 91%, the last classification model that I use was GaussianNB() but was not too efficient comparison with GradientBoostingClassifier().
GradientBoostingClassifier() was my final choice.
precision | recall | f1-score | support | |
---|---|---|---|---|
0 | 0.89 | 0.85 | 0.87 | 2572 |
1 | 0.91 | 0.94 | 0.92 | 4302 |
accuracy | 0.90 | 6874 | ||
macro avg | 0.90 | 0.89 | 0.89 | 6874 |
weighted avg | 0.90 | 0.90 | 0.90 | 6874 |
The pairplot provides crucial information relevant to the correlation of the features with the predicted variable or target variable.
4. Can you describe the quality of your classification model using 2-3 evaluation metrics? You need to provide an interpretation of each evaluation metric when you provide the value.
The evaluation of the performance of the model is provide by AUC with a 96%.
This graph provides the most important feature of the test dataset.
# %% #import sys #!{sys.executable} -m pip install seaborn scikit-learn #instalation of seaborn. # %% import pandas as pd import numpy as np import altair as alt #import matplotlib import seaborn as sns # %% # from sklearn from sklearn.model_selection import train_test_split # to split the test and training data and run (creation of model selection) from sklearn import tree from sklearn.naive_bayes import GaussianNB from sklearn.ensemble import GradientBoostingClassifier from sklearn import metrics from sklearn.metrics import r2_score from sklearn.tree import DecisionTreeClassifier # to build a classification tree from sklearn.tree import plot_tree from sklearn.model_selection import cross_val_score from sklearn.metrics import confusion_matrix from sklearn.metrics import plot_confusion_matrix # %% # dwellings_denver = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv") dwellings_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_ml/dwellings_ml.csv") dwellings_neighborhoods_ml = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_neighborhoods_ml/dwellings_neighborhoods_ml.csv") alt.data_transformers.enable('json') #df = pd.read_csv("https://github.com/byuidatascience/data4dwellings/raw/master/data-raw/dwellings_denver/dwellings_denver.csv") # %% dwellings_ml.head(5) # %% # create a new df = dwellings_ml.filter(['livearea', 'finbsmnt', 'basement', 'yearbuilt', 'nocars', 'numbdrm', 'numbaths', 'stories', 'yrbuilt', 'before1980']).sample(1000) # %% df # %% dwellings_ml.info() # %% dwellings_ml['livearea'].unique() #this dataset did not have any object datatype column. # %% #to review if I have any missing values df.isnull().sum() # %% df.isnull().values.any() # the data did have any null values #this Justify that my filter dataset did not have any missing values # %% len(dwellings_ml) # %% df_duplicate= df.duplicated() sum(df_duplicate) # %% df.shape # %% df.info() # %% df.describe() # %% df.skew() #calculate that Zcode # %% sns.pairplot(df, hue = 'before1980') corr = df.drop(columns = 'before1980').corr() #heatmap.save("screenshot/heatmap.png") # %% sns.heatmap(corr, annot = True) # %% dwellings_ml.nunique() #This #%% df.columns # %% #alt.Chart(df).mark_circle(size=50).encode( # x="yrbuilt", # y="before1980", # color ="Original"#, #tooltip = [] #)#.interactive() # %% graph1= (alt.Chart(df) .encode( x = alt.X('yrbuilt', scale=alt.Scale(zero=False), axis=alt.Axis(format='.0f')), y = alt.Y('finbsmnt', scale=alt.Scale(zero=False)), color = 'before1980:O') .mark_circle(size=50) .properties(width=800, height=600, title="year of built comparation with square footage") ) graph1.save("screenshot/graph1.png") graph1 # %% dat_count = (df .groupby(['yrbuilt', 'numbaths']) .agg(count = ('nocars', 'size')) .reset_index()) # %% graph3=(alt.Chart(dat_count) .encode( alt.X('yrbuilt:O', scale = alt.Scale(zero=False), axis=alt.Axis(format='.0f')), alt.Y('numbaths:O',scale = alt.Scale(zero=False)), color = alt.Color('count', scale=alt.Scale(type='log'))) .mark_rect() .properties(width=800, height=600, title="Year of Built in base of number of bath") ) hart_two = alt.Chart().mark_rule().encode( x='a') (graph3+hart_two).facet(row='site',data=dat_count) graph3.save("screenshot/graph3.png") graph3 # %% boxplot = (alt.Chart(dat_count) .encode( alt.X('yrbuilt:O', scale = alt.Scale(zero=False), axis=alt.Axis(format='.0f')), alt.Y('numbaths',scale = alt.Scale(zero=False))) .mark_boxplot(size = 3) .properties(width=650, height=600, title="year built and number of baths") ) boxplot.save("screenshot/boxplot.png") boxplot # %% #for col in df.select_dtypes(include=['int']).columns(): # print('We have {} unique values in {} column : {}'.format(len(df[col].unique()),col, df[col].unique())) # print('---'*30) # %% df[df.columns[:]].corr()['before1980'][:] # %% # formatting data ready fof the decision tree X_pred = dwellings_ml.drop(['yrbuilt','before1980'], axis=1) y_pred = dwellings_ml.before1980 # %% X_pred.head().T # %% y_pred= dwellings_ml['before1980'].copy() y_pred.head() #data that we want to predict # %% #one-Hot Encoding, as the dataset looks like the dataset # is ready to use with the one-Hot Encoding. # this dataset did not have any object data to transform to the # to categorical data and use of get_dummies(). # Preeliminary Decision Tree Classifier. X_train, X_test, y_train, y_test = train_test_split(X_pred, y_pred, test_size=0.34, random_state=76) clf_df = GradientBoostingClassifier()#DecisionTreeClassifier(random_state=70)#inizalize the decision tree clf_df = clf_df.fit(X_train, y_train) #clf = GradientBoostingClassifier() #clf = clf.fit(X_train, y_train) #predict_p = clf.predict(X_test) #clf_df = classifier # %% print(y_test.head(10)) # %% print(X_train.head(10)) # %% y_pred =clf_df.predict(X_test) # %% # prediction of 90% print(metrics.classification_report(y_test, y_pred)) #df4 = pd.DataFrame(get_classification_report) #print(report.to_markdown()) #y_pred # %% print(metrics.confusion_matrix(y_test, y_pred)) metrics.plot_confusion_matrix(clf_df, X_test, y_test, display_labels=['Does not have before1980', 'has before1980']) # %% #plot the roc curve metrics.plot_roc_curve(clf_df, X_test, y_test) # %% plot_df_features = pd.DataFrame( {'f_names': X_train.columns, 'f_values': clf_df.feature_importances_}).sort_values('f_values', ascending = False) df=(alt.Chart(plot_df_features.query('f_values > .011')) .encode( alt.X('f_values'), alt.Y('f_names', sort = '-x')) .mark_bar()) df.save("screenshot/features.png") # %% df_features = (pd.DataFrame( {'f_names': X_train.columns, 'f_values': clf_df.feature_importances_}) .sort_values('f_values', ascending = False)) df_features # %% #import matplotlib.pyplot as plt # %% #plt.figure(figsize=(15,7.5)) #plot_tree(clf_df, # filled=True, # rounded=True, # class_names=['NO','YES'], # feature_names = X_pred.columns); # %% #this desicion tree need optimization due to #print(classification_report(y_test, predictions)) # %% #path = clf_df.cost_complexity_pruning_path(X_train, y_train) #ccp_alphas = path.ccp_alphas #ccp_alphas= ccp_alphas[:-1] #clf_dts= [] #for ccp_alpha in ccp_alphas: # clf_dt = DecisionTreeClassifier(random_state= 0, ccp_alpha=ccp_alpha) # clf_dt.fit(X_train, y_train) # clf_dts.append(clf_dt) # %% #train_score = [clf_dt.score(X_train, y_train) for clf_dt in clf_dts] #test_score = [clf_dt.score(X_test, y_test) for clf_dt in clf_dts] #fig, ax = plt.subplots() #ax.set_xlabel('alpha') #ax.set_ylabel('accurancy') #ax.set_title('Accurancy vs alpha for train and test') #ax.plot(ccp_alphas, train_score, marker='o', label="train", drawstyle="steps-post") #ax.plot(ccp_alphas, test_score, marker='o', label="test", drawstyle="steps-post") #ax.legend() #plt.show() # %% #cross Validation #clf_dt = DecisionTreeClassifier(random_state=42, ccp_alpha=0.014) #scores = cross_val_score(clf_dt, X_train, cv=5) #df1 = pd.DataFrame(data = {'tree':range(5), 'accuracy':scores}) #df1.plot(x='tree', y='accuracy', marker='o', linestyle='--') # %%