In this notebook will be performed the analysis of the diamond dataset, a collection of almost 54,000 diamonds.
To realize interesting analysis and predictions about the considered dataset, we decided to maintain as target variable the column named "price". In this way it is possible to predict it through different methods, allowing us to compare the results obtained.
from IPython.display import HTML
HTML('''<script>
code_show=true;
function code_toggle() {
if (code_show){
$('div.input').hide();
} else {
$('div.input').show();
}
code_show = !code_show
}
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')
### Keeping safe from any disaster :) ###
%autosave 30
Image source: Diamond Education Learning
All the basic information about the diamond cut are given above. Side view of the diamond show the terminology and meaning of his features. Thus table length, table width, diamond depth, depth%, and table% are very important features to determine the price of a diamond.
For these analyses we used some libraries covered during the course and partly others based on their documentation. Below is the list:
ml_visualization
seaborn
### Imports ###
import numpy as np
import pandas as pd
import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly
import plotly.offline as py
import matplotlib.pyplot as plt
from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split
from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge
from IPython.display import Markdown, display
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler
# importing the modules
from IPython.display import display
from sklearn.model_selection import GridSearchCV, KFold, cross_val_score
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.stats as st
import ml_utilities
import ml_visualization
### Reading .csv dataset ###
dfName = "diamonds.csv"
df = pd.read_csv(dfName)
### Describing the df characteristics ###
df.describe()
### How many non-null fields are there? ###
df.info()
df.head()
### Dropping unnecessary column. Additional index ###
df.drop('Unnamed: 0', axis=1, inplace=True)
df_mapped=df.copy()
### Remapping all necessary columns
df_mapped['color'] = df['color'].replace(['J','I','H','G','F','E','D'],[1,2,3,4,5,6,7])
df_mapped['cut'] = df['cut'].replace(['Fair','Good','Very Good','Premium','Ideal'],[1,2,3,4,5])
df_mapped['clarity'] = df['clarity'].replace(['I3','I2','I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF'],[1,2,3,4,5,6,7,8,9,10])
#Functino to print in Markdown
def printmd(string):
display(Markdown(string))
# # Scatter plot of the carat and price
# fig=make_subplots(rows=1, cols=1)
# fig.add_trace(go.Scatter(x=df_mapped["carat"], y=df_mapped["price"], mode="markers", name="Diamonds price basing on carats"))
# fig.update_layout(
# title="Diamonds price basing on carats",
# xaxis_title="Carat",
# yaxis_title="Price",
# legend_title="Legend Title",
# )
plt.plot(df_mapped["carat"], df_mapped["price"], 'o', color='blue');
In this plot we can see an alternative and colorful representation of how each feature infer the price.
It is useful observe the legend on the right side that will explain how interpret each element in the plot.
# showing how the feature could infer price
plt.figure(figsize=(16,8))
sns.scatterplot(data=df, x="carat", y="price", hue="color",size="cut", style="clarity", sizes=(10,250))
plt.title("Influence of attributes over price")
plt.show()
In the following set of charts, there will be considered how the categorical attributes infer price variable. It is possible to get an alternative view of how the price varies for each categorical variable. What we expected was that the quantiles of price would increase as the quality of the categorical variables increased. And indeed despite the numerous outliers this pattern is easily identifiable from the boxplots.
categoricalAttributes=["clarity", "cut", "color"]
targetVariable="price"
for attribute in categoricalAttributes:
fig = px.box(data_frame=df_mapped,orientation="h",y=attribute,x="price",title=f"{targetVariable} related to {attribute}")
fig.show()
This set of plot represents the relation between each feature with the considered target. Clearly, the pairs clarity-price, color-price and cut-price will consider a discrete dimensional space due to its categorical nature. The other ones will represent the distribution of price in a continuous context.
df_mapped.hist(bins=50, figsize=(20,15))
plt.suptitle("Features distributions")
As said before we will make analyses where the target price variable will be the focus point. Since this says a lot about the diamond. Let's do a pre-analysis to see from the correlation value which features are most closely correlated with price.
### calculation of the correlation coefficient between price and all other fields ###
corr_matrix = df_mapped.corr()
print(corr_matrix["price"].sort_values(ascending = False))
To make the computation time of the code more optimal we made a sample of the dataset to make the plots of each linear regression
rows=1
cols=0
#split dataset in features and target variable
feature_cols = ['carat', 'table', 'x', 'y', 'z', 'depth']
# feature_cols = ['depth']
valid_portion=0.30
fig = make_subplots(rows=2, cols=3, subplot_titles=("Linear Regression for Carat feature", "Linear Regression for Table feature","Linear Regression for X feature",
"Linear Regression for Y feature", "Linear Regression for Z feature", "Linear Regression for Depth feature"))
fig['layout']['xaxis1']['title']=feature_cols[0]
fig['layout']['xaxis2']['title']=feature_cols[1]
fig['layout']['xaxis3']['title']=feature_cols[2]
fig['layout']['xaxis4']['title']=feature_cols[3]
fig['layout']['xaxis5']['title']=feature_cols[4]
fig['layout']['xaxis6']['title']=feature_cols[5]
for i in range(1, 7):
fig['layout']['yaxis{}'.format(i)]['title']="price"
for i in feature_cols:
X = df_mapped[i] # Features
y = df_mapped.price.astype('float64') # Target variable
train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=valid_portion, random_state=1)
train_x=train_x.values.reshape(-1, 1)
validation_x=validation_x.values.reshape(-1, 1)
train_y=train_y.values.reshape(-1, 1)
validation_y=validation_y.values.reshape(-1, 1)
# Addestramento di un LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(train_x, train_y)
# Ottenimento delle predizioni
train_y_predicted = lin_reg.predict(train_x)
# Calcolo del RMSE
rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
printmd(f"**Data Information on {i}**")
print('Train RMSE: ', rmse)
# Ottenimento delle predizioni (validation) e calcolo RMSE
validation_y_predicted = lin_reg.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
print('Validation RMSE: ', rmse)
print('R2 score:', lin_reg.score(validation_x, validation_y))
w=lin_reg.coef_[0][0]
b=lin_reg.intercept_
y_hat = b +w*X
x_line = np.array([np.min(X), np.max(X)]) # Prova a modificare gli estremi, per estrapolare la linea
y_line = b + w*x_line
if cols==3:
cols=0
rows=2
cols=cols+1
fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y).sample(n=500, random_state=132), mode="markers", name="data"), row=rows, col=cols)
fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y_hat).sample(n=500, random_state=132), mode="markers", name="estimate"),row=rows, col=cols)
fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y_hat).sample(n=500, random_state=132), mode="lines", name="regression line"),row=rows, col=cols)
# # Disegnamo le linee verticali corrispondenti agli errori (residui)
# for i in range(len(X)):
# data.append(go.Scatter(x=[X[i], X[i]], y=[y[i], y_hat[i]], mode="lines",
# showlegend=False, line=dict(color="gray", width=0.5)),)
fig.show()
#Correlation with Price column
fig=plt.figure(figsize=(15,8))
sns.heatmap(df_mapped.corr(), linewidths=3, annot=True)
plt.title("Correlation matrics", size=20)
#split dataset in features and target variable
feature_cols = ['clarity', 'carat', 'cut', 'color','depth','table','x', 'y', 'z']
X = df_mapped[feature_cols] # Features
y = df_mapped.price.astype('float64') # Target variable
# split X and y into training and testing sets
valid_portion = 0.5
train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=valid_portion, random_state=1)
We can see that with RandomForest the RMSE would look very good as well as the R2. But as we will see later on, to check its veracity we need to do a K-Fold.
# Addestramento di un RandomForestRegressor
rndfor_reg = RandomForestRegressor()
rndfor_reg.fit(train_x, train_y)
# Ottenimento delle predizioni
train_y_predicted = rndfor_reg.predict(train_x)
# Calcolo del RMSE
rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
printmd("**Regressor Information**")
print('Train RMSE: ', rmse)
# Ottenimento delle predizioni (validation) e calcolo RMSE
validation_y_predicted = rndfor_reg.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
print('Validation RMSE: ', rmse)
print('R2 score:', rndfor_reg.score(validation_x, validation_y))
printmd("**Information about feature importance according to the Regressor**")
for i, j in zip(feature_cols, rndfor_reg.feature_importances_):
print(i, j)
Compared to the RandomForest Regressor we can see that the linear one only makes things worse. Both in terms of score and RMSE. This is to be expected.
# Addestramento di un LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(train_x, train_y)
# Ottenimento delle predizioni
train_y_predicted = lin_reg.predict(train_x)
# Calcolo del RMSE
rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
print('Train RMSE: ', rmse)
# Ottenimento delle predizioni (validation) e calcolo RMSE
validation_y_predicted = lin_reg.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
print('Validation RMSE: ', rmse)
print('R2 score:', lin_reg.score(validation_x, validation_y))
By doing K-Fold Cross-Validation we can see that in fact the values we previously obtained with the RandomForestRegressor were due to chance. The true RMSE is 1770.62.
# Ottimizzazione con Cross-Validation
reg = RandomForestRegressor()
param_grid = [{'n_estimators': [10, 50, 100], 'max_depth': [None, 2, 3], 'random_state': [1234]}]
# Con shuffle=False si otterranno fold temporalmente ordinate
# (se i pattern non sono stati precedentemente mescolati)
experiment_gscv = GridSearchCV(reg, param_grid, \
cv=KFold(n_splits=4), \
scoring='neg_mean_squared_error')
experiment_gscv.fit(X, y)
# Print results
printmd('**Combinazioni di parametri:**')
print(experiment_gscv.cv_results_['params'])
printmd('**RMSE medio per combinazione:**')
print( np.sqrt(-experiment_gscv.cv_results_['mean_test_score']))
printmd('**Combinazione migliore:**')
print(experiment_gscv.best_params_)
printmd("**RMSE medio della combinazione migliore:**")
print('RMSE medio della combinazione migliore: %.3f' % np.sqrt(-experiment_gscv.best_score_))
The basic idea would be to be able to remove the outliers from the validation set and see if they actually worsen the R2 (although the R2 obtained so far is very good). But first...
Q1 = df_mapped.quantile(0.25)
Q3 = df_mapped.quantile(0.75)
IQR=Q3-Q1
df_clean=df_mapped[~((df_mapped<(Q1-1.5*IQR))|(df_mapped>(Q3+1.5*IQR))).any(axis=1)]
cols = df_mapped.columns
all_z_scores=pd.DataFrame()
for col in cols:
col_zscore = col + '_zscore'
all_z_scores[col_zscore] = (df_mapped[col] - df_mapped[col].mean())/df_mapped[col].std(ddof=0)
all_outliers={}
arr=[]
for col in all_z_scores.columns:
all_outliers[col+"_outlier"]=np.sum(all_z_scores[col]>3)+np.sum(all_z_scores[col]<-3)
outliers=pd.DataFrame(all_outliers, index=[0])
outliers=outliers.sort_values(by=0, axis=1, ascending=True)
px.histogram(y=outliers.columns, x=outliers.loc[0], title="Number of outliers by every column in the dataframe",labels={ # replaces default labels by column name
"sum of x": "Number of outliers", "y": "Column"
})
feature_cols = ['clarity', 'carat', 'cut', 'color','depth','table','x', 'y', 'z']
X_outliers = df_clean[feature_cols] # Features
y_outliers = df_clean.price.astype('float64') # Target variable
# split X and y into training and testing sets
valid_portion = 0.5
# Split random to train, test set
train_x_outliers, validation_x_outliers, train_y_outliers, validation_y_outliers = train_test_split(X_outliers, y_outliers, test_size=valid_portion, random_state=1)
print('R2 score:', rndfor_reg.score(validation_x_outliers, validation_y_outliers)) # R2 score migliora minimamente
We realised that there were variables that had a correlation with the price that was not strictly linear.
In fact, we can see from this scatter that the variables x, y, z and depth do not have a linear trend.
allGraphs=[]
fig, ax = plt.subplots(1, 4, figsize=(70, 20))
ax[0].scatter(x = df_mapped["depth"], y = df_mapped["price"])
ax[0].set_xlabel("Depth")
ax[0].set_ylabel("Price")
ax[0].set_title('Correlation between the depth and the price')
ax[1].scatter(x =df_mapped["x"], y=df_mapped["price"])
ax[1].set_xlabel("X")
ax[1].set_ylabel("Price")
ax[1].set_title('Correlation between the x and the price')
ax[2].scatter(x = df_mapped["y"], y = df_mapped["price"])
ax[2].set_xlabel("Y")
ax[2].set_ylabel("Price")
ax[2].set_title('Correlation between the y and the price')
ax[3].scatter(y =df_mapped["price"], x=df_mapped["z"])
ax[3].set_xlabel("Z")
ax[3].set_ylabel("Price")
ax[3].set_title('Correlation between the z and the price')
plt.show()
cols=0
def linear_regression(data, power, models_to_plot, col, fig):
#initialize predictors:
global cols
predictors=[col]
if power>=2:
predictors.extend([col+'_{}'.format(i) for i in range(2,power+1)])
spaced_value_feature=np.linspace(np.min(data[col]), np.max(data[col]), 500)
df_spaced=pd.DataFrame({col:spaced_value_feature})
for i in range(2,16): #power of 1 is already there
colname = col+'_{}'.format(i) #new var will be x_power
df_spaced[colname] = df_spaced[col]**i
data=data.sort_values(col)
#Fit the model
linreg = LinearRegression(normalize=True)
linreg.fit(data[predictors],data['price'])
y_pred = linreg.predict(data[predictors])
#Check if a plot is to be made for the entered power
if power in models_to_plot:
cols=cols+1
sampled=data.sample(n=500, random_state=1234)
spaced=linreg.predict(df_spaced[predictors])
fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
fig.add_trace(go.Scatter(x=df_spaced[col], y=spaced, mode="lines", name="estimate"), row=1, col=cols)
fig.update_yaxes(range=[np.min(sampled['price']), np.max(sampled['price'])])
fig.update_xaxes(range=[np.min(sampled[col]), np.max(sampled[col])])
#Return the result in pre-defined format
rss = sum((y_pred-data['price'])**2)
ret = [rss]
ret.extend([linreg.intercept_])
ret.extend(linreg.coef_)
return ret
df_copy=df_mapped.copy()
col_features = ['x', 'y', 'z']
for col_feature in col_features:
for i in range(2,16): #power of 1 is already there
colname = col_feature+'_{}'.format(i) #new var will be x_power
df_copy[colname] = df_copy[col_feature]**i
global cols
cols=0
#Initialize a dataframe to store the results:
col = ['rss','intercept'] + ['coef'+col_feature+'_{}'.format(i) for i in range(1,16)]
ind = ['model_pow_{}'.format(i) for i in range(1,16)]
coef_matrix_simple = pd.DataFrame(index=ind, columns=col)
#Define the powers for which a plot is required:
models_to_plot = {2:231,6:233,15:236}
fig = make_subplots(rows=1, cols=3, subplot_titles=("Power 2", "Power 6", "Power 15"))
#Iterate through all powers and assimilate results
for i in range(1,16):
coef_matrix_simple.iloc[i-1,0:i+2] =linear_regression(df_copy, power=i, models_to_plot=models_to_plot,col=col_feature, fig=fig)
fig.update_layout(height=600, width=800, title_text=f"Non Linear Regression for feature {col_feature}")
for i in range(1, 4):
fig['layout']['xaxis{}'.format(i)]['title']=col_feature
fig['layout']['yaxis']['title']="price"
fig.show()
display(coef_matrix_simple)
From the correlation table we can see that the rss is so big because of the big values of the prices. Doing an error on huge numbers will make the rss a great number too.
def ridge_regression(data, predictors, alpha,col,fig, models_to_plot={}):
global cols
spaced_value_feature=np.linspace(np.min(data[col]), np.max(data[col]), 500)
df_spaced=pd.DataFrame({col:spaced_value_feature})
for i in range(2,6): #power of 1 is already there
colname = col+'_{}'.format(i) #new var will be x_power
df_spaced[colname] = df_spaced[col]**i
ridgereg = Ridge(alpha=alpha,normalize=True)
ridgereg.fit(data[predictors],data['price'])
y_pred = ridgereg.predict(data[predictors])
#Check if a plot is to be made for the entered alpha
if alpha in models_to_plot:
cols=cols+1
sampled=data.sample(n=500, random_state=1234)
spaced=ridgereg.predict(df_spaced[predictors])
fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
fig.add_trace(go.Scatter(x=df_spaced[col], y=spaced, mode="lines", name="estimate"), row=1, col=cols)
fig.update_yaxes(range=[np.min(sampled['price']), np.max(sampled['price'])])
fig.update_xaxes(range=[np.min(sampled[col]), np.max(sampled[col])])
#Return the result in pre-defined format
rss = sum((y_pred-data['price'])**2)
ret = [rss]
ret.extend([ridgereg.intercept_])
ret.extend(ridgereg.coef_)
return ret
#Initialize predictors to be set of 15 powers of x
for col_feature in col_features:
global cols
cols=0
predictors=[col_feature]
predictors.extend([col_feature+'_{}'.format(i) for i in range(2,6)])
#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]
#Initialize the dataframe for storing coefficients.
col = ['rss','intercept'] + ['coef_'+col_feature+'_{}'.format(i) for i in range(1,6)]
ind = ['alpha_{}'.format(alpha_ridge[i]) for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)
models_to_plot = {1e-15:231, 1e-10:232, 5:236}
fig = make_subplots(rows=1, cols=3, subplot_titles=("Alpha 1e-15", "Alpha 1e-10", "Alpha 5"))
for i in range(10):
coef_matrix_ridge.iloc[i,] = ridge_regression(df_copy, predictors, alpha_ridge[i],col_feature,fig, models_to_plot)
fig.update_layout(height=600, width=800, title_text=f"Ridge Regression for feature {col_feature}")
for i in range(1, 4):
fig['layout']['xaxis{}'.format(i)]['title']=col_feature
fig['layout']['yaxis']['title']="price"
fig.show()
display(coef_matrix_ridge)
def lasso_regression(data, predictors, alpha,col,fig, models_to_plot={}):
global cols
spaced_value_feature=np.linspace(np.min(data[col]), np.max(data[col]), 500)
df_spaced=pd.DataFrame({col:spaced_value_feature})
for i in range(2,16): #power of 1 is already there
colname = col+'_{}'.format(i) #new var will be x_power
df_spaced[colname] = df_spaced[col]**i
lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e2)
lassoreg.fit(data[predictors],data['price'])
y_pred = lassoreg.predict(data[predictors])
#Check if a plot is to be made for the entered alpha
if alpha in models_to_plot:
cols=cols+1
sampled=data.sample(n=500, random_state=1234)
spaced=lassoreg.predict(df_spaced[predictors])
fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
fig.add_trace(go.Scatter(x=df_spaced[col], y=spaced, mode="lines", name="estimate"), row=1, col=cols)
fig.update_yaxes(range=[np.min(sampled['price']), np.max(sampled['price'])])
fig.update_xaxes(range=[np.min(sampled[col]), np.max(sampled[col])])
#Return the result in pre-defined format
rss = sum((y_pred-data['price'])**2)
ret = [rss]
ret.extend([lassoreg.intercept_])
ret.extend(lassoreg.coef_)
return ret
#Initialize predictors to be set of 15 powers of x
for col_feature in col_features:
global cols
cols=0
predictors=[col_feature]
predictors.extend([col_feature+'_{}'.format(i) for i in range(2,16)])
#Set the different values of alpha to be tested
alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]
#Initialize the dataframe for storing coefficients.
col = ['rss','intercept'] + ['coef_'+col_feature+'_{}'.format(i) for i in range(1,16)]
ind = ['alpha_{}'.format(alpha_ridge[i]) for i in range(0,10)]
coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)
models_to_plot = {1e-15:231, 1e-10:232, 5:236}
fig = make_subplots(rows=1, cols=3, subplot_titles=("Alpha 1e-15", "Alpha 1e-10", "Alpha 5"))
for i in range(10):
coef_matrix_ridge.iloc[i,] = lasso_regression(df_copy, predictors, alpha_ridge[i],col_feature,fig, models_to_plot)
fig.update_layout(height=600, width=800, title_text=f"Ridge Regression for feature {col_feature}")
for i in range(1, 4):
fig['layout']['xaxis{}'.format(i)]['title']=col_feature
fig['layout']['yaxis']['title']="price"
fig.show()
display(coef_matrix_ridge)
def lasso_regression(data, predictors, alpha, models_to_plot={}):
#Fit the model
lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
lassoreg.fit(data[predictors],data['price'])
y_pred = lassoreg.predict(data[predictors])
#Check if a plot is to be made for the entered alpha
if alpha in models_to_plot:
plt.subplot(models_to_plot[alpha])
plt.tight_layout()
plt.plot(data[predictors],y_pred)
plt.plot(data[predictors],data['price'],'.')
plt.title('Plot for alpha: {}'.format(alpha))
#Return the result in pre-defined format
rss = sum((y_pred-data['price'])**2)
ret = [rss]
ret.extend([lassoreg.intercept_])
ret.extend(lassoreg.coef_)
return ret
As mentioned above, the variable with the most importance seems to be carat. In fact, from the table we can see that the feature that is never sent to 0 is x, even though the correlation between price and carat is more significant than the correlation between price and feature x. (0.92 control 0.88 respectively).
#Initialize predictors to all 15 powers of x
predictors=["cut", "table", "depth", "color", "clarity", "x", "y", "z"]
#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]
#Initialize the dataframe to store coefficients
col = ['rss','intercept'] + ['coef_{}'.format(i) for i in predictors]
ind = ['alpha_{}'.format(alpha_lasso[i]) for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)
#Define the models to plot
models_to_plot = {}
#Iterate over the 10 alpha values:
for i in range(10):
coef_matrix_lasso.iloc[i,] = lasso_regression(df_mapped, predictors, alpha_lasso[i], models_to_plot)
display(coef_matrix_lasso)
In this section we will have an overview regarding the classification of our dataset.
There will be focused all the cathegorical variables ('color', 'cut' and 'clarity').
We can start with a multiple plot created with Seaborn.
In this case we can see the variety of possible classification that Seaborn can realize.
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="scatter", hue="color", plot_kws=dict(s=80, edgecolor="white", linewidth=1.0))
plt.show()
In our project we compute neighbours basing on euclidean distance.
def euclidean_distance(row1, row2):
distance = 0.0
for i in range(len(row1)-1):
distance += (row1[i] - row2[i])**2
return sqrt(distance)
# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
distances = list()
for train_row in train:
dist = euclidean_distance(test_row, train_row)
distances.append((train_row, dist))
distances.sort(key=lambda tup: tup[1])
neighbors = list()
for i in range(num_neighbors):
neighbors.append(distances[i][0])
return neighbors
# neighbors = get_neighbors(dataset, dataset[0], 3)
# for neighbor in neighbors:
# print(neighbor)
# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors):
neighbors = get_neighbors(train, test_row, num_neighbors)
output_values = [row[-1] for row in neighbors]
prediction = max(set(output_values), key=output_values.count)
return prediction
# prediction = predict_classification(df, dataset[0], 3)
# print('Expected %d, Got %d.' % (df[0][-1], prediction))
def set_df_last_column(df, column_name):
df_col=df.drop(column_name, axis=1)
df_col[column_name]=df[column_name]
return df_col
feature_clas_cols=["cut", "color", "clarity"]
for i in feature_clas_cols:
df_color=set_df_last_column(df_mapped, i)
X = df_color.iloc[:, :-1].values
y = df_color.iloc[:, len(df_mapped.columns)-1].values
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)
scaler = StandardScaler()
scaler.fit(X_train)
X_train = scaler.transform(X_train)
X_test = scaler.transform(X_test)
classifier = KNeighborsClassifier(n_neighbors=5)
classifier.fit(X_train, y_train)
y_pred = classifier.predict(X_test)
printmd(f"**Printing the confusione matrix for {i}**")
print(confusion_matrix(y_test, y_pred))
classification_report(y_test, y_pred)
error = []
# Calculating error for K values between 1 and 40
for i in range(1, 20):
knn = KNeighborsClassifier(n_neighbors=i)
knn.fit(X_train, y_train)
pred_i = knn.predict(X_test)
error.append(np.mean(pred_i != y_test))
plt.figure(figsize=(12, 6))
plt.plot(range(1, 20), error, color='red', linestyle='dashed', marker='o',
markerfacecolor='blue', markersize=10)
plt.title('Error Rate K Value')
plt.xlabel('K Value')
plt.ylabel('Mean Error')
plt.show()
# plt.savefig("knn_color.png", dpi=300, bbox_inches='tight')
We implemented a simple method that taking a dataframe and a feature name as input will return a new dataframe with the target value (specified as parameter) in the last column.
This is made to use subsequently the method getDatasetPatternsAndLabels().
As made before, we split test and validation;
Subsequently, we consider a sample of 1000 elements and using as feature columns 'table' and 'x' and as target 'color'.
In this way we are going to train 3 different classifiers (KNN, linear SVM and radial-basis-function SVM) to compare the final results.
We used a k-value of 5 for KNN and for each classifier we split in train and test with a 0.4 ratio.
def getDatasetPatternsAndLabels(df, featureCount):
patterns=[]
labels=df.iloc[:, -1:].values.reshape(1,-1)[0]
for i, row in df.iterrows():
currentRow=[]
for columnName in range(featureCount):
currentRow.append(row[df.columns[columnName]])
patterns.append(currentRow)
return pd.DataFrame(patterns), pd.DataFrame(labels)[0]
feature_count = 5
dfCut = set_df_last_column(df_mapped, "clarity")
dataset_patterns, dataset_labels = getDatasetPatternsAndLabels(dfCut, feature_count)
train_x, validation_x, train_y, validation_y = train_test_split(dataset_patterns, dataset_labels, test_size=0.25)
print('Shape training set:', train_x.shape)
print('Shape validation set:', validation_x.shape)
# separate array into input and output components
df_sampled=df_mapped.head(n=1000)
#split dataset in features and target variable
feature_cols = ['table', 'x']
X = df_sampled[feature_cols] # Features
y = df_sampled.color.astype('float64') # Target variable
scaler = StandardScaler().fit(X)
X = scaler.transform(X)
# summarize transformed data
np.set_printoptions(precision=3)
train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=0.40)
print('Shape dataset:', train_x.shape)
print('Shape labels:', validation_x.shape)
# train_x=train_x.values
# validation_x=validation_x.values
train_y=train_y.values
validation_y=validation_y.values
# validation_x
# train_y
train_x
# Creazione del classificatore: soluzione
# Di seguito รจ riportata la procedura corretta per creare i suddetti classificatori.
clf_1 = KNeighborsClassifier(n_neighbors=5)
clf_2 = SVC(kernel="linear", C=1)
clf_3 = SVC(kernel="rbf", C=1, gamma=0.1)
printmd("**KNN**")
# Addestramento del classificatore: soluzione
clf_1.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con KNN: {}".format(clf_1.score(train_x, train_y)))
print("Accuratezza sul validation set con KNN: {}".format(clf_1.score(validation_x, validation_y)))
printmd("**SVM linear**")
# Addestramento del classificatore: soluzione
clf_2.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con SVM con kernel lineare: {}".format(clf_2.score(train_x, train_y)))
print("Accuratezza sul validation set con SVM con kernel lineare: {}".format(clf_2.score(validation_x, validation_y)))
printmd("**SVM rbf**")
# Addestramento del classificatore: soluzione
clf_3.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con SVM con kernel rbf: {}".format(clf_3.score(train_x, train_y)))
print("Accuratezza sul validation set con SVM con kernel rbf: {}".format(clf_3.score(validation_x, validation_y)))
#Visualizzazione 2D
# For some reason this plot is not working
# ml_visualization.show_2D_results(clf, (train_x, train_y, 'Training'), (validation_x, validation_y, 'Validation'))
# ml_visualization.show_2D_results(clf_2, (train_x, train_y, 'Training'), (validation_x, validation_y, 'Validation'))
Now, we would like to sum up with a triplet of plot representing the mean price trend in relation to clarity, cut and color. In this we a person could approssimate if it is worth to buy a considered diamond with certain features.
def display_price(df, age = (1,7), price = (100,100000), vehicle_type = "all", state = "all"):
# Display the median price of vehicles depending on its type and its state.
df = df[(df[vehicle_type] <= age[1]) & (df[vehicle_type] >= age[0])]
df = df[(df["price"] >= price[0]) & (df["price"] <= price[1])]
price_age = pd.pivot_table(df, values = "price", index = vehicle_type, aggfunc= np.average)
price_age.columns = ["Average Price"]
fig = plt.figure(figsize=(12,6))
ax = fig.add_axes([0,0,1,1])
ax2 = fig.add_axes([0.6,0.47,.35,.35])
ax.plot(price_age["Average Price"], lw = 5)
ax2.set_title(f"Column type: {vehicle_type}\nNumber of diamonds: {df.shape[0]}", fontsize = 15)
ax2.axis('off')
ax.set_title(f"Average price by {vehicle_type} of the diamonds",fontsize=25)
ax.set_ylim(0,price_age["Average Price"].max()+1000)
ax.set_xlabel(vehicle_type, fontsize = 15)
ax.set_ylabel("Average price in $", fontsize = 15)
ax.tick_params(axis='both', which='major', labelsize=15)
# plt.savefig(f"medians{vehicle_type}.png",dpi=300, bbox_inches='tight')
display_price(df_mapped, vehicle_type="color")
display_price(df_mapped,age=(1, 5), vehicle_type="cut")
display_price(df_mapped,age=(3, 10), vehicle_type="clarity")