Group M (Mohamed Boutaleb, Francesco Moro)¶

Machine Learning Project - Diamond Analysis¶

Introduction¶

In this notebook will be performed the analysis of the diamond dataset, a collection of almost 54,000 diamonds.
To realize interesting analysis and predictions about the considered dataset, we decided to maintain as target variable the column named "price". In this way it is possible to predict it through different methods, allowing us to compare the results obtained.

from IPython.display import HTML

HTML('''<script>
code_show=true; 
function code_toggle() {
 if (code_show){
 $('div.input').hide();
 } else {
 $('div.input').show();
 }
 code_show = !code_show
} 
$( document ).ready(code_toggle);
</script>
<form action="javascript:code_toggle()"><input type="submit" value="Click here to toggle on/off the raw code."></form>''')

### Keeping safe from any disaster :) ###
%autosave 30

Autosaving every 30 seconds

4C's Diamond¶

drawing

Image source: Diamond Education Learning

In a nutshell, to understand how much a diamond is worth, we should start with the basics, i.e. the 4 C's (CARAT, CLARITY, COLOR, CUT). These explain very well the quality of a diamond:¶

Carat: Carat is the weight of the diamond. Bigger the diamond, rare to find them.

Clarity: Clarity of the diamond is intrusions in the diamond, which could be internal flaws, external defects.

Color: Color of the diamond is self explinary, and on scale of AtoZ, A is best and colorless, where as Z is worst and yellowish color. The colorless diamonds are rare and costly.

Cut: Cut of the diamond is about physical dimensions of the diamond. It determine the light amount enter into the diamond. Anotomy of the diamond is an important parameter.

Cut and Components¶

There are several things to consider to see in a diamond. Few such is table%, depth% and cuts. Diamond anotomy is given it the cut of the diamond infographic in the above image.¶

All the basic information about the diamond cut are given above. Side view of the diamond show the terminology and meaning of his features. Thus table length, table width, diamond depth, depth%, and table% are very important features to determine the price of a diamond.

Libraries used¶

For these analyses we used some libraries covered during the course and partly others based on their documentation. Below is the list:

numpy
pandas
plotly
scikit learn
ml_utilities
ml_visualization
seaborn
matplotlib

### Imports ###
import numpy as np
import pandas as pd

import plotly.express as px
import plotly.graph_objects as go
from plotly.subplots import make_subplots
import plotly
import plotly.offline as py
import matplotlib.pyplot as plt

from sklearn.metrics import mean_squared_error
from sklearn.model_selection import train_test_split

from sklearn.dummy import DummyClassifier
from sklearn.svm import SVC
from sklearn.linear_model import LinearRegression 
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import Lasso
from sklearn.linear_model import Ridge

from IPython.display import Markdown, display


from sklearn.neighbors import KNeighborsClassifier

from sklearn.metrics import classification_report, confusion_matrix
from sklearn.preprocessing import StandardScaler

# importing the modules
from IPython.display import display

from sklearn.model_selection import GridSearchCV, KFold, cross_val_score

import seaborn as sns
import matplotlib.pyplot as plt

import scipy.stats as st

import ml_utilities
import ml_visualization

Reading the dataset¶

In order to properly import the dataset you need to:

Download the zipper from here
Extract it
Move it into the current directory ./.

### Reading .csv dataset ###
dfName = "diamonds.csv"
df = pd.read_csv(dfName)

Preliminary analysis¶

What are the columns?¶

### Describing the df characteristics ###
df.describe()

### How many non-null fields are there? ###
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 53940 entries, 0 to 53939
Data columns (total 11 columns):
 #   Column      Non-Null Count  Dtype  
---  ------      --------------  -----  
 0   Unnamed: 0  53940 non-null  int64  
 1   carat       53940 non-null  float64
 2   cut         53940 non-null  object 
 3   color       53940 non-null  object 
 4   clarity     53940 non-null  object 
 5   depth       53940 non-null  float64
 6   table       53940 non-null  float64
 7   price       53940 non-null  int64  
 8   x           53940 non-null  float64
 9   y           53940 non-null  float64
 10  z           53940 non-null  float64
dtypes: float64(6), int64(2), object(3)
memory usage: 4.5+ MB

df.head()

Dropping and remapping columns¶

### Dropping unnecessary column. Additional index ###
df.drop('Unnamed: 0', axis=1, inplace=True)

df_mapped=df.copy()

### Remapping all necessary columns
df_mapped['color'] = df['color'].replace(['J','I','H','G','F','E','D'],[1,2,3,4,5,6,7])
df_mapped['cut'] = df['cut'].replace(['Fair','Good','Very Good','Premium','Ideal'],[1,2,3,4,5])
df_mapped['clarity'] = df['clarity'].replace(['I3','I2','I1','SI2','SI1','VS2','VS1','VVS2','VVS1','IF'],[1,2,3,4,5,6,7,8,9,10])

#Functino to print in Markdown
def printmd(string):
    display(Markdown(string))

First question we asked ourselves¶

Is there a correlation between price and carats?¶

Clearly we notice immediately that the pair of carat-price features are directly proportional: as the carats increase we have confirmation that the price increases easily.

# # Scatter plot of the carat and price
# fig=make_subplots(rows=1, cols=1)
# fig.add_trace(go.Scatter(x=df_mapped["carat"], y=df_mapped["price"], mode="markers", name="Diamonds price basing on carats"))

# fig.update_layout(
#     title="Diamonds price basing on carats",
#     xaxis_title="Carat",
#     yaxis_title="Price",
#     legend_title="Legend Title",
# )

plt.plot(df_mapped["carat"], df_mapped["price"], 'o', color='blue');

How do the other attributes of the diamond infer price?¶

In this plot we can see an alternative and colorful representation of how each feature infer the price.
It is useful observe the legend on the right side that will explain how interpret each element in the plot.

# showing how the feature could infer price
plt.figure(figsize=(16,8))
sns.scatterplot(data=df, x="carat", y="price", hue="color",size="cut", style="clarity", sizes=(10,250))
plt.title("Influence of attributes over price")
plt.show()

How do price quantiles change according to categorical attributes?¶

In the following set of charts, there will be considered how the categorical attributes infer price variable. It is possible to get an alternative view of how the price varies for each categorical variable. What we expected was that the quantiles of price would increase as the quality of the categorical variables increased. And indeed despite the numerous outliers this pattern is easily identifiable from the boxplots.

categoricalAttributes=["clarity", "cut", "color"]
targetVariable="price"

for attribute in categoricalAttributes:
    fig = px.box(data_frame=df_mapped,orientation="h",y=attribute,x="price",title=f"{targetVariable} related to {attribute}")
    fig.show()

What is the distribution of the diamonds in this Dataset according to attributes?¶

This set of plot represents the relation between each feature with the considered target. Clearly, the pairs clarity-price, color-price and cut-price will consider a discrete dimensional space due to its categorical nature. The other ones will represent the distribution of price in a continuous context.

df_mapped.hist(bins=50, figsize=(20,15))
plt.suptitle("Features distributions")

Text(0.5, 0.98, 'Features distributions')

Linear Regression¶

As said before we will make analyses where the target price variable will be the focus point. Since this says a lot about the diamond. Let's do a pre-analysis to see from the correlation value which features are most closely correlated with price.

Computing correlation matrix to detect some linear patterns¶

### calculation of the correlation coefficient between price and all other fields ###
corr_matrix = df_mapped.corr()

print(corr_matrix["price"].sort_values(ascending = False))

price      1.000000
carat      0.921591
x          0.884435
y          0.865421
z          0.861249
table      0.127134
depth     -0.010647
cut       -0.053491
clarity   -0.146800
color     -0.172511
Name: price, dtype: float64

As can be seen from the graphs and the data (R2), we can notice that the correlation between carats and price is linear, the same with the x, y and z features. What tend to have a normal distribution are the features depth and table.

To make the computation time of the code more optimal we made a sample of the dataset to make the plots of each linear regression

rows=1
cols=0
#split dataset in features and target variable
feature_cols = ['carat', 'table', 'x', 'y', 'z', 'depth']
# feature_cols = ['depth']
valid_portion=0.30
fig = make_subplots(rows=2, cols=3, subplot_titles=("Linear Regression for Carat feature", "Linear Regression for Table feature","Linear Regression for X feature", 
                                                   "Linear Regression for Y feature", "Linear Regression for Z feature", "Linear Regression for Depth feature"))
fig['layout']['xaxis1']['title']=feature_cols[0]
fig['layout']['xaxis2']['title']=feature_cols[1]
fig['layout']['xaxis3']['title']=feature_cols[2]
fig['layout']['xaxis4']['title']=feature_cols[3]
fig['layout']['xaxis5']['title']=feature_cols[4]
fig['layout']['xaxis6']['title']=feature_cols[5]

for i in range(1, 7):
    fig['layout']['yaxis{}'.format(i)]['title']="price"


for i in feature_cols:
    X = df_mapped[i] # Features
    y = df_mapped.price.astype('float64') # Target variable

    train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=valid_portion, random_state=1)
    
    train_x=train_x.values.reshape(-1, 1)
    validation_x=validation_x.values.reshape(-1, 1)
    
    train_y=train_y.values.reshape(-1, 1)
    validation_y=validation_y.values.reshape(-1, 1)
    
    # Addestramento di un LinearRegression
    lin_reg = LinearRegression()
    lin_reg.fit(train_x, train_y)

    # Ottenimento delle predizioni
    train_y_predicted = lin_reg.predict(train_x)

    # Calcolo del RMSE
    rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
    printmd(f"**Data Information on {i}**")
    print('Train RMSE: ', rmse) 

    # Ottenimento delle predizioni (validation) e calcolo RMSE
    validation_y_predicted = lin_reg.predict(validation_x)
    rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
    print('Validation RMSE: ', rmse)
    print('R2 score:', lin_reg.score(validation_x, validation_y))
    
    w=lin_reg.coef_[0][0]
    b=lin_reg.intercept_
    
    y_hat = b +w*X
    
    x_line = np.array([np.min(X), np.max(X)]) # Prova a modificare gli estremi, per estrapolare la linea
    y_line = b + w*x_line
    
    if cols==3:
        cols=0
        rows=2
    
    cols=cols+1

    fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y).sample(n=500, random_state=132), mode="markers", name="data"), row=rows, col=cols)
    fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y_hat).sample(n=500, random_state=132), mode="markers", name="estimate"),row=rows, col=cols)
    fig.add_trace(go.Scatter(x=pd.Series(X).sample(n=500, random_state=132), y=pd.Series(y_hat).sample(n=500, random_state=132), mode="lines", name="regression line"),row=rows, col=cols)

#     # Disegnamo le linee verticali corrispondenti agli errori (residui)
#     for i in range(len(X)):
#         data.append(go.Scatter(x=[X[i], X[i]], y=[y[i], y_hat[i]], mode="lines",
#             showlegend=False, line=dict(color="gray", width=0.5)),)

fig.show()

Train RMSE:  1560.5276938406769
Validation RMSE:  1520.3317090552023
R2 score: 0.8493196667739158

Train RMSE:  3987.536167459719
Validation RMSE:  3885.017951050689
R2 score: 0.016066098278325724

Train RMSE:  1877.118093178531
Validation RMSE:  1825.6024177121258
R2 score: 0.7827337495915477

Train RMSE:  2070.0363229147342
Validation RMSE:  1825.6734589316966
R2 score: 0.7827168399297663

Train RMSE:  2073.91823176725
Validation RMSE:  1915.4660086333242
R2 score: 0.7608178508319458

Train RMSE:  4019.899198428712
Validation RMSE:  3916.634223668388
R2 score: -1.3571168991033389e-05

To make the idea even clearer, it may be useful to look at this heatmap of all the correlations between the various columns. We can see how the features carat, x, y and z are closely correlated with price. Although when we later go to do the Lasso we will see that in reality x is never sent to 0.¶

#Correlation with Price column
fig=plt.figure(figsize=(15,8))
sns.heatmap(df_mapped.corr(), linewidths=3, annot=True)
plt.title("Correlation matrics", size=20)

Text(0.5, 1.0, 'Correlation matrics')

#split dataset in features and target variable
feature_cols = ['clarity', 'carat', 'cut', 'color','depth','table','x', 'y', 'z']
X = df_mapped[feature_cols] # Features
y = df_mapped.price.astype('float64') # Target variable

# split X and y into training and testing sets
valid_portion = 0.5

train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=valid_portion, random_state=1)

Start with RandomForest to check efficiency¶

We can see that with RandomForest the RMSE would look very good as well as the R2. But as we will see later on, to check its veracity we need to do a K-Fold.

# Addestramento di un RandomForestRegressor
rndfor_reg = RandomForestRegressor()
rndfor_reg.fit(train_x, train_y)

# Ottenimento delle predizioni
train_y_predicted = rndfor_reg.predict(train_x)

# Calcolo del RMSE
rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
printmd("**Regressor Information**")
print('Train RMSE: ', rmse) 

# Ottenimento delle predizioni (validation) e calcolo RMSE
validation_y_predicted = rndfor_reg.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
print('Validation RMSE: ', rmse)
print('R2 score:', rndfor_reg.score(validation_x, validation_y))
printmd("**Information about feature importance according to the Regressor**")
for i, j in zip(feature_cols, rndfor_reg.feature_importances_):
    print(i, j)

Train RMSE:  210.82269795973664
Validation RMSE:  555.2446984208308
R2 score: 0.9802245255322912

clarity 0.06283779909917907
carat 0.4569335821198947
cut 0.0017155499307173166
color 0.031422949172279904
depth 0.0029773483667109267
table 0.002279805899265216
x 0.005687965671299705
y 0.43019541831084096
z 0.0059495814298123825

Check if the LinearRegressor can do better¶

Compared to the RandomForest Regressor we can see that the linear one only makes things worse. Both in terms of score and RMSE. This is to be expected.

# Addestramento di un LinearRegression
lin_reg = LinearRegression()
lin_reg.fit(train_x, train_y)

# Ottenimento delle predizioni
train_y_predicted = lin_reg.predict(train_x)

# Calcolo del RMSE
rmse = np.sqrt(mean_squared_error(train_y, train_y_predicted))
print('Train RMSE: ', rmse) 

# Ottenimento delle predizioni (validation) e calcolo RMSE
validation_y_predicted = lin_reg.predict(validation_x)
rmse = np.sqrt(mean_squared_error(validation_y, validation_y_predicted))
print('Validation RMSE: ', rmse)
print('R2 score:', lin_reg.score(validation_x, validation_y))

Train RMSE:  1229.2485789721898
Validation RMSE:  1204.5442615091756
R2 score: 0.9069313108306197

Doing a K-Fold Cross-Validation to check if it affects RMSE optimisation¶

By doing K-Fold Cross-Validation we can see that in fact the values we previously obtained with the RandomForestRegressor were due to chance. The true RMSE is 1770.62.

# Ottimizzazione con Cross-Validation
reg = RandomForestRegressor() 

param_grid = [{'n_estimators': [10, 50, 100], 'max_depth': [None, 2, 3], 'random_state': [1234]}]

# Con shuffle=False si otterranno fold temporalmente ordinate
# (se i pattern non sono stati precedentemente mescolati)
experiment_gscv = GridSearchCV(reg, param_grid, \
                               cv=KFold(n_splits=4), \
                               scoring='neg_mean_squared_error')

experiment_gscv.fit(X, y)

# Print results
printmd('**Combinazioni di parametri:**')
print(experiment_gscv.cv_results_['params'])
printmd('**RMSE medio per combinazione:**')
print( np.sqrt(-experiment_gscv.cv_results_['mean_test_score']))
printmd('**Combinazione migliore:**')
print(experiment_gscv.best_params_)
printmd("**RMSE medio della combinazione migliore:**")
print('RMSE medio della combinazione migliore: %.3f' % np.sqrt(-experiment_gscv.best_score_))

[{'max_depth': None, 'n_estimators': 10, 'random_state': 1234}, {'max_depth': None, 'n_estimators': 50, 'random_state': 1234}, {'max_depth': None, 'n_estimators': 100, 'random_state': 1234}, {'max_depth': 2, 'n_estimators': 10, 'random_state': 1234}, {'max_depth': 2, 'n_estimators': 50, 'random_state': 1234}, {'max_depth': 2, 'n_estimators': 100, 'random_state': 1234}, {'max_depth': 3, 'n_estimators': 10, 'random_state': 1234}, {'max_depth': 3, 'n_estimators': 50, 'random_state': 1234}, {'max_depth': 3, 'n_estimators': 100, 'random_state': 1234}]

[1858.5211656  1806.64242055 1770.62353266 2892.45112031 2868.41034669
 2875.19755245 2506.29438524 2488.8674007  2496.64787653]

{'max_depth': None, 'n_estimators': 100, 'random_state': 1234}

RMSE medio della combinazione migliore: 1770.624

What happens if we remove the outliers (z-score >3 or z-score <-3)?¶

The basic idea would be to be able to remove the outliers from the validation set and see if they actually worsen the R2 (although the R2 obtained so far is very good). But first...

It would be useful to count how many outliers are there per-columns¶

Q1 = df_mapped.quantile(0.25)
Q3 = df_mapped.quantile(0.75)
IQR=Q3-Q1
df_clean=df_mapped[~((df_mapped<(Q1-1.5*IQR))|(df_mapped>(Q3+1.5*IQR))).any(axis=1)]

cols = df_mapped.columns
all_z_scores=pd.DataFrame()

for col in cols:
    col_zscore = col + '_zscore'
    all_z_scores[col_zscore] = (df_mapped[col] - df_mapped[col].mean())/df_mapped[col].std(ddof=0)

all_outliers={}
arr=[]

for col in all_z_scores.columns:
    all_outliers[col+"_outlier"]=np.sum(all_z_scores[col]>3)+np.sum(all_z_scores[col]<-3)
    
outliers=pd.DataFrame(all_outliers, index=[0])
outliers=outliers.sort_values(by=0, axis=1, ascending=True)
px.histogram(y=outliers.columns, x=outliers.loc[0], title="Number of outliers by every column in the dataframe",labels={ # replaces default labels by column name
                "sum of x": "Number of outliers",  "y": "Column"
            })

Could the R2 score improve if we do testing on the dataset without the outliers?¶

We expected it to be a lot better but...¶

feature_cols = ['clarity', 'carat', 'cut', 'color','depth','table','x', 'y', 'z']
X_outliers = df_clean[feature_cols] # Features
y_outliers = df_clean.price.astype('float64') # Target variable
# split X and y into training and testing sets
valid_portion = 0.5
# Split random to train, test set
train_x_outliers, validation_x_outliers, train_y_outliers, validation_y_outliers = train_test_split(X_outliers, y_outliers, test_size=valid_portion, random_state=1)
    
print('R2 score:', rndfor_reg.score(validation_x_outliers, validation_y_outliers)) # R2 score migliora minimamente

R2 score: 0.9875655765338912

Non-Linear Regression¶

We realised that there were variables that had a correlation with the price that was not strictly linear.

Is there a non-linear correlation between price and other attributes?¶

In fact, we can see from this scatter that the variables x, y, z and depth do not have a linear trend.

allGraphs=[]

fig, ax = plt.subplots(1, 4, figsize=(70, 20))
ax[0].scatter(x = df_mapped["depth"], y = df_mapped["price"])
ax[0].set_xlabel("Depth")
ax[0].set_ylabel("Price")
ax[0].set_title('Correlation between the depth and the price')

ax[1].scatter(x =df_mapped["x"], y=df_mapped["price"])
ax[1].set_xlabel("X")
ax[1].set_ylabel("Price")
ax[1].set_title('Correlation between the x and the price')
                    
ax[2].scatter(x = df_mapped["y"], y = df_mapped["price"])
ax[2].set_xlabel("Y")
ax[2].set_ylabel("Price")
ax[2].set_title('Correlation between the y and the price')

ax[3].scatter(y =df_mapped["price"], x=df_mapped["z"])
ax[3].set_xlabel("Z")
ax[3].set_ylabel("Price")
ax[3].set_title('Correlation between the z and the price')

plt.show()

cols=0
def linear_regression(data, power, models_to_plot, col, fig):
    #initialize predictors:
    global cols
    predictors=[col]
    if power>=2:
        predictors.extend([col+'_{}'.format(i) for i in range(2,power+1)])
    
    spaced_value_feature=np.linspace(np.min(data[col]), np.max(data[col]), 500)
    df_spaced=pd.DataFrame({col:spaced_value_feature})

    for i in range(2,16):  #power of 1 is already there
        colname = col+'_{}'.format(i)      #new var will be x_power
        df_spaced[colname] = df_spaced[col]**i
    
    data=data.sort_values(col)
    #Fit the model
    linreg = LinearRegression(normalize=True)
    linreg.fit(data[predictors],data['price'])
    y_pred = linreg.predict(data[predictors])    
    
    #Check if a plot is to be made for the entered power
    if power in models_to_plot:
        cols=cols+1
        sampled=data.sample(n=500, random_state=1234)
        spaced=linreg.predict(df_spaced[predictors])
        fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
        fig.add_trace(go.Scatter(x=df_spaced[col], y=spaced, mode="lines", name="estimate"), row=1, col=cols)
        fig.update_yaxes(range=[np.min(sampled['price']), np.max(sampled['price'])])
        fig.update_xaxes(range=[np.min(sampled[col]), np.max(sampled[col])])

    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([linreg.intercept_])
    ret.extend(linreg.coef_)
    return ret

df_copy=df_mapped.copy()

col_features = ['x', 'y', 'z']
for col_feature in col_features:
    for i in range(2,16):  #power of 1 is already there
        colname = col_feature+'_{}'.format(i)      #new var will be x_power
        df_copy[colname] = df_copy[col_feature]**i
        
    global cols
    cols=0
    #Initialize a dataframe to store the results:
    col = ['rss','intercept'] + ['coef'+col_feature+'_{}'.format(i) for i in range(1,16)]
    ind = ['model_pow_{}'.format(i) for i in range(1,16)]
    coef_matrix_simple = pd.DataFrame(index=ind, columns=col)

    #Define the powers for which a plot is required:
    models_to_plot = {2:231,6:233,15:236}
    
    fig = make_subplots(rows=1, cols=3, subplot_titles=("Power 2", "Power 6", "Power 15"))

    #Iterate through all powers and assimilate results
    for i in range(1,16):
        coef_matrix_simple.iloc[i-1,0:i+2] =linear_regression(df_copy, power=i, models_to_plot=models_to_plot,col=col_feature, fig=fig)
    fig.update_layout(height=600, width=800, title_text=f"Non Linear Regression for feature {col_feature}")
    
    for i in range(1, 4):
        fig['layout']['xaxis{}'.format(i)]['title']=col_feature
    
    fig['layout']['yaxis']['title']="price"
    fig.show()
    display(coef_matrix_simple)

Ridge-Regression¶

From the correlation table we can see that the rss is so big because of the big values of the prices. Doing an error on huge numbers will make the rss a great number too.

def ridge_regression(data, predictors, alpha,col,fig, models_to_plot={}):
    global cols
    
    spaced_value_feature=np.linspace(np.min(data[col]), np.max(data[col]), 500)
    df_spaced=pd.DataFrame({col:spaced_value_feature})

    for i in range(2,6):  #power of 1 is already there
        colname = col+'_{}'.format(i)      #new var will be x_power
        df_spaced[colname] = df_spaced[col]**i
    
    ridgereg = Ridge(alpha=alpha,normalize=True)
    ridgereg.fit(data[predictors],data['price'])
    y_pred = ridgereg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:        
        cols=cols+1
        sampled=data.sample(n=500, random_state=1234)
        spaced=ridgereg.predict(df_spaced[predictors])
        fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
        fig.add_trace(go.Scatter(x=df_spaced[col], y=spaced, mode="lines", name="estimate"), row=1, col=cols)
        fig.update_yaxes(range=[np.min(sampled['price']), np.max(sampled['price'])])
        fig.update_xaxes(range=[np.min(sampled[col]), np.max(sampled[col])])
        
    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([ridgereg.intercept_])
    ret.extend(ridgereg.coef_)
    return ret

#Initialize predictors to be set of 15 powers of x
for col_feature in col_features:
    global cols
    cols=0
    predictors=[col_feature]
    predictors.extend([col_feature+'_{}'.format(i) for i in range(2,6)])

    #Set the different values of alpha to be tested
    alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

    #Initialize the dataframe for storing coefficients.
    col = ['rss','intercept'] + ['coef_'+col_feature+'_{}'.format(i) for i in range(1,6)]
    ind = ['alpha_{}'.format(alpha_ridge[i]) for i in range(0,10)]
    coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

    models_to_plot = {1e-15:231, 1e-10:232, 5:236}
    
    fig = make_subplots(rows=1, cols=3, subplot_titles=("Alpha 1e-15", "Alpha 1e-10", "Alpha 5"))
    for i in range(10):
        coef_matrix_ridge.iloc[i,] = ridge_regression(df_copy, predictors, alpha_ridge[i],col_feature,fig, models_to_plot)
        
    fig.update_layout(height=600, width=800, title_text=f"Ridge Regression for feature {col_feature}")
    for i in range(1, 4):
        fig['layout']['xaxis{}'.format(i)]['title']=col_feature
    
    fig['layout']['yaxis']['title']="price"
    fig.show()
    display(coef_matrix_ridge)

Lasso¶

Do Lasso to check the columns that infers more¶

def lasso_regression(data, predictors, alpha,col,fig, models_to_plot={}):
    global cols
    
    spaced_value_feature=np.linspace(np.min(data[col]), np.max(data[col]), 500)
    df_spaced=pd.DataFrame({col:spaced_value_feature})

    for i in range(2,16):  #power of 1 is already there
        colname = col+'_{}'.format(i)      #new var will be x_power
        df_spaced[colname] = df_spaced[col]**i
        
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e2)
    lassoreg.fit(data[predictors],data['price'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        cols=cols+1
        sampled=data.sample(n=500, random_state=1234)
        spaced=lassoreg.predict(df_spaced[predictors])
        fig.add_trace(go.Scatter(x=sampled[col], y=sampled['price'], mode="markers", name="data"), row=1, col=cols)
        fig.add_trace(go.Scatter(x=df_spaced[col], y=spaced, mode="lines", name="estimate"), row=1, col=cols)
        fig.update_yaxes(range=[np.min(sampled['price']), np.max(sampled['price'])])
        fig.update_xaxes(range=[np.min(sampled[col]), np.max(sampled[col])])

    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

#Initialize predictors to be set of 15 powers of x
for col_feature in col_features:
    global cols
    cols=0
    predictors=[col_feature]
    predictors.extend([col_feature+'_{}'.format(i) for i in range(2,16)])

    #Set the different values of alpha to be tested
    alpha_ridge = [1e-15, 1e-10, 1e-8, 1e-4, 1e-3,1e-2, 1, 5, 10, 20]

    #Initialize the dataframe for storing coefficients.
    col = ['rss','intercept'] + ['coef_'+col_feature+'_{}'.format(i) for i in range(1,16)]
    ind = ['alpha_{}'.format(alpha_ridge[i]) for i in range(0,10)]
    coef_matrix_ridge = pd.DataFrame(index=ind, columns=col)

    models_to_plot = {1e-15:231, 1e-10:232, 5:236}
    
    fig = make_subplots(rows=1, cols=3, subplot_titles=("Alpha 1e-15", "Alpha 1e-10", "Alpha 5"))
    for i in range(10):
        coef_matrix_ridge.iloc[i,] = lasso_regression(df_copy, predictors, alpha_ridge[i],col_feature,fig, models_to_plot)
        
    fig.update_layout(height=600, width=800, title_text=f"Ridge Regression for feature {col_feature}")
    for i in range(1, 4):
        fig['layout']['xaxis{}'.format(i)]['title']=col_feature
    
    fig['layout']['yaxis']['title']="price"
    fig.show()
    display(coef_matrix_ridge)

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57651714885.70906, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57651714755.14087, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57651701828.757706, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 57520961678.93393, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 56327857082.33457, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 42744663693.95874, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 628430506.3542633, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 357646514.3161011, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 105515872.39941406, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87663854888.01605, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87663854679.23033, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87663834009.23819, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 87455262081.28275, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 85595374398.21869, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 68636813011.58795, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76643982098.60341, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76643982015.29265, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76643974154.57953, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 76568138233.23845, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 75885934365.49455, tolerance: 85847313.55173963

C:\Users\MohamedBoutaleb\anaconda3\lib\site-packages\sklearn\linear_model\_coordinate_descent.py:529: ConvergenceWarning:

Objective did not converge. You might want to increase the number of iterations. Duality gap: 69099137304.37404, tolerance: 85847313.55173963

def lasso_regression(data, predictors, alpha, models_to_plot={}):
    #Fit the model
    lassoreg = Lasso(alpha=alpha,normalize=True, max_iter=1e5)
    lassoreg.fit(data[predictors],data['price'])
    y_pred = lassoreg.predict(data[predictors])
    
    #Check if a plot is to be made for the entered alpha
    if alpha in models_to_plot:
        plt.subplot(models_to_plot[alpha])
        plt.tight_layout()
        plt.plot(data[predictors],y_pred)
        plt.plot(data[predictors],data['price'],'.')
        plt.title('Plot for alpha: {}'.format(alpha))
    
    #Return the result in pre-defined format
    rss = sum((y_pred-data['price'])**2)
    ret = [rss]
    ret.extend([lassoreg.intercept_])
    ret.extend(lassoreg.coef_)
    return ret

As mentioned above, the variable with the most importance seems to be carat. In fact, from the table we can see that the feature that is never sent to 0 is x, even though the correlation between price and carat is more significant than the correlation between price and feature x. (0.92 control 0.88 respectively).

#Initialize predictors to all 15 powers of x
predictors=["cut", "table", "depth", "color", "clarity", "x", "y", "z"]

#Define the alpha values to test
alpha_lasso = [1e-15, 1e-10, 1e-8, 1e-5,1e-4, 1e-3,1e-2, 1, 5, 10]

#Initialize the dataframe to store coefficients
col = ['rss','intercept'] + ['coef_{}'.format(i) for i in predictors]
ind = ['alpha_{}'.format(alpha_lasso[i]) for i in range(0,10)]
coef_matrix_lasso = pd.DataFrame(index=ind, columns=col)

#Define the models to plot
models_to_plot = {}

#Iterate over the 10 alpha values:
for i in range(10):
    coef_matrix_lasso.iloc[i,] = lasso_regression(df_mapped, predictors, alpha_lasso[i], models_to_plot)
display(coef_matrix_lasso)

Classification¶

In this section we will have an overview regarding the classification of our dataset. There will be focused all the cathegorical variables ('color', 'cut' and 'clarity').
We can start with a multiple plot created with Seaborn. In this case we can see the variety of possible classification that Seaborn can realize.

plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="scatter", hue="color", plot_kws=dict(s=80, edgecolor="white", linewidth=1.0))
plt.show()

<Figure size 800x640 with 0 Axes>

K-Nearest Neighbors¶

In our project we compute neighbours basing on euclidean distance.

def euclidean_distance(row1, row2):
    distance = 0.0
    for i in range(len(row1)-1):
        distance += (row1[i] - row2[i])**2
    return sqrt(distance)

# Locate the most similar neighbors
def get_neighbors(train, test_row, num_neighbors):
    distances = list()
    for train_row in train:
        dist = euclidean_distance(test_row, train_row)
        distances.append((train_row, dist))
    distances.sort(key=lambda tup: tup[1])
    neighbors = list()
    for i in range(num_neighbors):
        neighbors.append(distances[i][0])
    return neighbors


# neighbors = get_neighbors(dataset, dataset[0], 3)
# for neighbor in neighbors:
#     print(neighbor)

# Make a classification prediction with neighbors
def predict_classification(train, test_row, num_neighbors):
    neighbors = get_neighbors(train, test_row, num_neighbors)
    output_values = [row[-1] for row in neighbors]
    prediction = max(set(output_values), key=output_values.count)
    return prediction

# prediction = predict_classification(df, dataset[0], 3)
# print('Expected %d, Got %d.' % (df[0][-1], prediction))

def set_df_last_column(df, column_name):
    df_col=df.drop(column_name, axis=1)
    df_col[column_name]=df[column_name]
    return df_col

feature_clas_cols=["cut", "color", "clarity"]

for i in feature_clas_cols:
    df_color=set_df_last_column(df_mapped, i)

    X = df_color.iloc[:, :-1].values
    y = df_color.iloc[:, len(df_mapped.columns)-1].values

    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.20)

    scaler = StandardScaler()
    scaler.fit(X_train)

    X_train = scaler.transform(X_train)
    X_test = scaler.transform(X_test)

    classifier = KNeighborsClassifier(n_neighbors=5)
    classifier.fit(X_train, y_train)

    y_pred = classifier.predict(X_test)

    printmd(f"**Printing the confusione matrix for {i}**")
    
    print(confusion_matrix(y_test, y_pred))
    classification_report(y_test, y_pred)


    error = []

    # Calculating error for K values between 1 and 40
    for i in range(1, 20):
        knn = KNeighborsClassifier(n_neighbors=i)
        knn.fit(X_train, y_train)
        pred_i = knn.predict(X_test)
        error.append(np.mean(pred_i != y_test))

    plt.figure(figsize=(12, 6))
    plt.plot(range(1, 20), error, color='red', linestyle='dashed', marker='o',
             markerfacecolor='blue', markersize=10)
    plt.title('Error Rate K Value')
    plt.xlabel('K Value')
    plt.ylabel('Mean Error')

    plt.show()

    # plt.savefig("knn_color.png", dpi=300, bbox_inches='tight')

[[ 245   66    8    8    4]
 [  40  548  279  109   28]
 [   1  294  832  738  586]
 [   1   70  575 1843  275]
 [   2   22  339  263 3612]]

[[192 156  87  45  33  31  10]
 [121 339 258 168 102  63  40]
 [ 83 265 529 340 174 141  88]
 [ 86 192 409 921 335 259 120]
 [ 39 111 251 503 506 335 121]
 [ 32 103 208 371 463 554 172]
 [ 28  52 157 263 279 357 296]]

[[  37   82   16    4    4    0    0    0]
 [  22 1004  556  160   58   17   18    2]
 [   8  583 1253  506  178   43   36    8]
 [   9  228  749 1056  281  100   58   16]
 [   4  102  395  507  440  112   61   16]
 [   3   29  130  230  219  240   78   27]
 [   1   24  101  138  139  138  155   41]
 [   1    9   41   53   56   64   62   80]]

SVM¶

We implemented a simple method that taking a dataframe and a feature name as input will return a new dataframe with the target value (specified as parameter) in the last column. This is made to use subsequently the method getDatasetPatternsAndLabels().
As made before, we split test and validation; Subsequently, we consider a sample of 1000 elements and using as feature columns 'table' and 'x' and as target 'color'. In this way we are going to train 3 different classifiers (KNN, linear SVM and radial-basis-function SVM) to compare the final results. We used a k-value of 5 for KNN and for each classifier we split in train and test with a 0.4 ratio.

def getDatasetPatternsAndLabels(df, featureCount):
    patterns=[]
    labels=df.iloc[:, -1:].values.reshape(1,-1)[0]
    
    for i, row in df.iterrows():
        currentRow=[]
        for columnName in range(featureCount):
            currentRow.append(row[df.columns[columnName]])
        patterns.append(currentRow)
    
    return pd.DataFrame(patterns), pd.DataFrame(labels)[0]

feature_count = 5

dfCut = set_df_last_column(df_mapped, "clarity")
dataset_patterns, dataset_labels = getDatasetPatternsAndLabels(dfCut, feature_count)

train_x, validation_x, train_y, validation_y = train_test_split(dataset_patterns, dataset_labels, test_size=0.25)
print('Shape training set:', train_x.shape)
print('Shape validation set:', validation_x.shape)

Shape training set: (40455, 5)
Shape validation set: (13485, 5)

# separate array into input and output components

df_sampled=df_mapped.head(n=1000)
#split dataset in features and target variable
feature_cols = ['table', 'x']
X = df_sampled[feature_cols] # Features
y = df_sampled.color.astype('float64') # Target variable

scaler = StandardScaler().fit(X)
X = scaler.transform(X)
# summarize transformed data
np.set_printoptions(precision=3)

train_x, validation_x, train_y, validation_y = train_test_split(X, y, test_size=0.40)
print('Shape dataset:', train_x.shape)
print('Shape labels:', validation_x.shape)
# train_x=train_x.values
# validation_x=validation_x.values
train_y=train_y.values
validation_y=validation_y.values
# validation_x
# train_y
train_x

Shape dataset: (600, 2)
Shape labels: (400, 2)

array([[ 0.918,  0.215],
       [-1.514,  0.375],
       [ 1.729,  0.167],
       ...,
       [-0.703,  0.359],
       [-0.703,  1.063],
       [ 2.54 ,  0.087]])

# Creazione del classificatore: soluzione
# Di seguito è riportata la procedura corretta per creare i suddetti classificatori.

clf_1 = KNeighborsClassifier(n_neighbors=5)
clf_2 = SVC(kernel="linear", C=1)
clf_3 = SVC(kernel="rbf", C=1, gamma=0.1)

printmd("**KNN**")
# Addestramento del classificatore: soluzione
clf_1.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con KNN: {}".format(clf_1.score(train_x, train_y)))
print("Accuratezza sul validation set con  KNN: {}".format(clf_1.score(validation_x, validation_y)))

printmd("**SVM linear**")
# Addestramento del classificatore: soluzione
clf_2.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con SVM con kernel lineare: {}".format(clf_2.score(train_x, train_y)))
print("Accuratezza sul validation set con SVM con kernel lineare: {}".format(clf_2.score(validation_x, validation_y)))

printmd("**SVM rbf**")
# Addestramento del classificatore: soluzione
clf_3.fit(train_x, train_y)
# Calcolo accuratezza
print("Accuratezza sul training set con SVM con kernel rbf: {}".format(clf_3.score(train_x, train_y)))
print("Accuratezza sul validation set con SVM con kernel rbf: {}".format(clf_3.score(validation_x, validation_y)))

#Visualizzazione 2D
# For some reason this plot is not working
# ml_visualization.show_2D_results(clf, (train_x, train_y, 'Training'), (validation_x, validation_y, 'Validation'))

Accuratezza sul training set con KNN: 0.48333333333333334
Accuratezza sul validation set con  KNN: 0.25

Accuratezza sul training set con SVM con kernel lineare: 0.2633333333333333
Accuratezza sul validation set con SVM con kernel lineare: 0.2175

Accuratezza sul training set con SVM con kernel rbf: 0.2733333333333333
Accuratezza sul validation set con SVM con kernel rbf: 0.2325

# ml_visualization.show_2D_results(clf_2, (train_x, train_y, 'Training'), (validation_x, validation_y, 'Validation'))

Bonus: how much should we spend on a diamond according to its characteristics?¶

Now, we would like to sum up with a triplet of plot representing the mean price trend in relation to clarity, cut and color. In this we a person could approssimate if it is worth to buy a considered diamond with certain features.

def display_price(df, age = (1,7), price = (100,100000), vehicle_type = "all", state = "all"):
    # Display the median price of vehicles depending on its type and its state.

    df = df[(df[vehicle_type] <= age[1]) & (df[vehicle_type] >= age[0])]

    df = df[(df["price"] >= price[0]) & (df["price"] <= price[1])]

    price_age = pd.pivot_table(df, values = "price", index = vehicle_type, aggfunc= np.average)
    price_age.columns = ["Average Price"]

    fig = plt.figure(figsize=(12,6))
    ax = fig.add_axes([0,0,1,1])
    ax2 = fig.add_axes([0.6,0.47,.35,.35])

    ax.plot(price_age["Average Price"], lw = 5)

    ax2.set_title(f"Column type: {vehicle_type}\nNumber of diamonds: {df.shape[0]}", fontsize = 15)
    ax2.axis('off')

    ax.set_title(f"Average price by {vehicle_type} of the diamonds",fontsize=25)
    ax.set_ylim(0,price_age["Average Price"].max()+1000)
    ax.set_xlabel(vehicle_type, fontsize = 15)
    ax.set_ylabel("Average price in $", fontsize = 15)

    ax.tick_params(axis='both', which='major', labelsize=15) 

#     plt.savefig(f"medians{vehicle_type}.png",dpi=300, bbox_inches='tight')

display_price(df_mapped, vehicle_type="color")
display_price(df_mapped,age=(1, 5), vehicle_type="cut")
display_price(df_mapped,age=(3, 10), vehicle_type="clarity")

	Unnamed: 0	carat	depth	table	price	x	y	z
count	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000	53940.000000
mean	26970.500000	0.797940	61.749405	57.457184	3932.799722	5.731157	5.734526	3.538734
std	15571.281097	0.474011	1.432621	2.234491	3989.439738	1.121761	1.142135	0.705699
min	1.000000	0.200000	43.000000	43.000000	326.000000	0.000000	0.000000	0.000000
25%	13485.750000	0.400000	61.000000	56.000000	950.000000	4.710000	4.720000	2.910000
50%	26970.500000	0.700000	61.800000	57.000000	2401.000000	5.700000	5.710000	3.530000
75%	40455.250000	1.040000	62.500000	59.000000	5324.250000	6.540000	6.540000	4.040000
max	53940.000000	5.010000	79.000000	95.000000	18823.000000	10.740000	58.900000	31.800000

	rss	intercept	coefx_1	coefx_2	coefx_3	coefx_4	coefx_5	coefx_6	coefx_7	coefx_8	coefx_9	coefx_10	coefx_11	coefx_12	coefx_13	coefx_14	coefx_15
model_pow_1	1.86954e+11	-14094.1	3145.41	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_2	1.20902e+11	12791.9	-6248.31	790.246	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_3	1.20307e+11	19581.8	-9824.89	1400.99	-33.842	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_4	1.14774e+11	6093.92	4676.11	-3531.73	639.769	-32.1179	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_5	1.13448e+11	8084.5	-8979.33	4766.1	-1285.78	164.385	-7.4027	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_6	1.12663e+11	8493.84	-46806.2	34695.4	-10598.6	1585.36	-113.699	3.11932	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_7	1.12099e+11	8402.28	74575	-80682.6	34297.6	-7566.85	917.228	-57.7299	1.47051	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_8	1.12096e+11	8403.98	43451.9	-46212.3	18199.3	-3457.44	297.922	-2.62038	-1.21121	0.0550633	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_9	1.11906e+11	8407.51	-1.01843e+06	1.29655e+06	-713604	221057	-42115.5	5050.18	-371.991	15.3813	-0.27329	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_10	1.11767e+11	8406.76	2.5786e+06	-3.81887e+06	2.47638e+06	-923872	218547	-33992.2	3475.83	-225.276	8.39597	-0.137075	NaN	NaN	NaN	NaN	NaN
model_pow_11	1.11755e+11	8406.71	7.15839e+06	-1.10499e+07	7.55134e+06	-3.00883e+06	773856	-134190	15881.4	-1266.12	65.0416	-1.94312	0.0256226	NaN	NaN	NaN	NaN
model_pow_12	1.11655e+11	8406.72	-4.58049e+07	8.10774e+07	-6.44895e+07	3.04205e+07	-9.45458e+06	2.03273e+06	-308492	33047	-2449.01	119.58	-3.46278	0.0450588	NaN	NaN	NaN
model_pow_13	1.11603e+11	8406.56	-2.04892e+08	3.83513e+08	-3.2533e+08	1.65378e+08	-5.61099e+07	1.33866e+07	-2.30303e+06	287920	-25964.5	1647.49	-69.8344	1.77601	-0.0204995	NaN	NaN
model_pow_14	1.11564e+11	8407.82	3.14384e+08	-6.88955e+08	6.87364e+08	-4.13509e+08	1.67397e+08	-4.81665e+07	1.01373e+07	-1.58045e+06	182573	-15438.6	929.141	-37.6978	0.924465	-0.0103519	NaN
model_pow_15	1.11371e+11	8428.61	4.64256e+09	-1.03339e+10	1.05774e+10	-6.59885e+09	2.80332e+09	-8.57908e+08	1.95061e+08	-3.34767e+07	4.35834e+06	-428401	31300.5	-1648.66	59.1886	-1.29664	0.0130797

	rss	intercept	coefy_1	coefy_2	coefy_3	coefy_4	coefy_5	coefy_6	coefy_7	coefy_8	coefy_9	coefy_10	coefy_11	coefy_12	coefy_13	coefy_14	coefy_15
model_pow_1	2.15517e+11	-13402	3022.89	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_2	1.95775e+11	-15761.2	3730.24	-49.6423	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_3	1.7107e+11	-3046.15	-932.377	401.752	-6.56847	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_4	1.15337e+11	25612.7	-13051.9	1984.21	-73.8428	0.744498	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_5	1.14511e+11	16756.1	-7797.8	893.166	18.9312	-2.08403	0.0261961	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_6	1.10369e+11	6781.94	7065.94	-5315.7	1098.33	-84.2331	2.40393	-0.0214768	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_7	1.07996e+11	8778.99	-19777.5	13007.4	-3805.65	546.816	-36.7429	0.999211	-0.00875111	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_8	1.07722e+11	8959.64	-54186.4	42342.8	-13911.6	2342.4	-210.478	9.81874	-0.213587	0.00165007	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_9	1.07158e+11	8892.97	140035	-154238	69156.4	-16633.9	2320.95	-188.268	8.39717	-0.179672	0.00138036	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_10	1.07082e+11	8899.21	-128693	160819	-88891.8	27647	-5257.07	624.201	-45.2726	1.87261	-0.0382606	0.00028629	NaN	NaN	NaN	NaN	NaN
model_pow_11	1.07035e+11	8899.57	-982435	1.29465e+06	-745229	244915	-50580.4	6807.41	-598.525	33.4696	-1.1164	0.0194927	-0.000131696	NaN	NaN	NaN	NaN
model_pow_12	1.06893e+11	9022.84	4.85145e+06	-7.38647e+06	4.96939e+06	-1.94597e+06	491100	-83469.4	9685.99	-760.361	39.0454	-1.22451	0.0205059	-0.000134942	NaN	NaN	NaN
model_pow_13	1.06891e+11	8923.61	4.36935e+06	-6.58695e+06	4.37589e+06	-1.68569e+06	416187	-68608.1	7611.17	-556.138	25.0936	-0.586854	0.00231683	0.000148581	-1.77595e-06	NaN	NaN
model_pow_14	1.0689e+11	8891.95	4.0399e+06	-6.04986e+06	3.98515e+06	-1.51847e+06	369500	-59705.2	6433.45	-448.991	18.6459	-0.35535	-0.00105322	9.07465e-05	8.46426e-07	-2.36425e-08	NaN
model_pow_15	1.06885e+11	8644.07	-1.41184e+06	2.76637e+06	-2.37376e+06	1.17779e+06	-375701	80825	-11927.3	1198.65	-79.1223	3.13353	-0.0566664	-0.000217353	1.46594e-05	1.43014e-07	-3.82082e-09

	rss	intercept	coefz_1	coefz_2	coefz_3	coefz_4	coefz_5	coefz_6	coefz_7	coefz_8	coefz_9	coefz_10	coefz_11	coefz_12	coefz_13	coefz_14	coefz_15
model_pow_1	2.217e+11	-13296.6	4868.79	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_2	2.12469e+11	-15219.1	5859.02	-121.473	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_3	1.3434e+11	14286.2	-11955.2	2742.02	-74.7947	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_4	1.34188e+11	16425	-13965.9	3372.18	-146.408	1.6894	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_5	1.25151e+11	8713.35	3935.36	-7838.93	2632.71	-266.181	6.00674	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_6	1.24998e+11	8294.8	8749.49	-12442.9	4325.85	-549.771	26.2688	-0.40503	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_7	1.22158e+11	8961.33	-55923.3	67893.9	-35160.5	9110.38	-1184.15	69.6424	-1.26998	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_8	1.22109e+11	8910.94	-34128	35097.2	-14913.5	2519.21	23.4767	-52.379	4.76907	-0.100942	NaN	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_9	1.20793e+11	8767.26	170238	-346397	283291	-124429	32021	-4894.38	426.907	-18.8054	0.293652	NaN	NaN	NaN	NaN	NaN	NaN
model_pow_10	1.20418e+11	8798.29	-35067.6	134119	-187723	130406	-51892.4	12498.2	-1826.27	154.597	-6.68234	0.103303	NaN	NaN	NaN	NaN	NaN
model_pow_11	1.2034e+11	8803.6	-266656	767046	-917288	601681	-241584	62244.6	-10414.1	1112.21	-71.848	2.48215	-0.033106	NaN	NaN	NaN	NaN
model_pow_12	1.20218e+11	8805.16	456057	-1.45376e+06	2.00562e+06	-1.59141e+06	805126	-271957	62386.5	-9675.97	987.359	-62.3601	2.14141	-0.0285933	NaN	NaN	NaN
model_pow_13	1.19993e+11	9079.48	-1.3364e+06	4.73922e+06	-7.30048e+06	6.4974e+06	-3.73954e+06	1.46976e+06	-404590	78466.9	-10600.5	966.194	-55.718	1.78328	-0.0227218	NaN	NaN
model_pow_14	1.20016e+11	8212.17	-1.19541e+06	4.15461e+06	-6.23777e+06	5.37657e+06	-2.97179e+06	1.10807e+06	-283823	49512.5	-5620.04	360.766	-5.43608	-0.890873	0.0572678	-0.000971049	NaN
model_pow_15	1.20006e+11	8407.26	-1.25217e+06	4.3553e+06	-6.54949e+06	5.65947e+06	-3.13993e+06	1.17747e+06	-304375	53958.7	-6328.16	443.76	-12.5288	-0.458942	0.0394604	-0.000529718	-4.74955e-06

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5
alpha_1e-15	1.13448e+11	8084.5	-8979.33	4766.1	-1285.78	164.385	-7.4027
alpha_1e-10	1.13448e+11	8083.41	-8974.12	4763.05	-1285.08	164.314	-7.40008
alpha_1e-08	1.13449e+11	7980.16	-8482.79	4474.65	-1219.3	157.685	-7.153
alpha_0.0001	1.15574e+11	13961.8	-4459.71	-78.665	74.2241	8.45359	-0.99066
alpha_0.001	1.19381e+11	8141.18	-3307.96	162.593	38.1599	3.0158	-0.325176
alpha_0.01	1.26381e+11	-334.75	-450.018	64.0332	11.6815	1.26068	0.0579142
alpha_1	1.52743e+11	-2921.37	490.225	44.5227	5.04942	0.606781	0.0734062
alpha_5	3.17696e+11	-252.28	312.301	27.1417	2.98389	0.350912	0.0419131
alpha_10	4.59588e+11	1143.24	209.294	18.0887	1.98036	0.232185	0.0276782
alpha_20	6.00951e+11	2260.89	125.78	10.8407	1.18435	0.138641	0.0165103

	Unnamed: 0	carat	cut	color	clarity	depth	table	price	x	y	z
0	1	0.23	Ideal	E	SI2	61.5	55.0	326	3.95	3.98	2.43
1	2	0.21	Premium	E	SI1	59.8	61.0	326	3.89	3.84	2.31
2	3	0.23	Good	E	VS1	56.9	65.0	327	4.05	4.07	2.31
3	4	0.29	Premium	I	VS2	62.4	58.0	334	4.20	4.23	2.63
4	5	0.31	Good	J	SI2	63.3	58.0	335	4.34	4.35	2.75

	rss	intercept	coef_x_1	coef_x_2	coef_x_3	coef_x_4	coef_x_5	coef_x_6	coef_x_7	coef_x_8	coef_x_9	coef_x_10	coef_x_11	coef_x_12	coef_x_13	coef_x_14	coef_x_15
alpha_1e-15	1.15303e+11	5096.65	-2172.9	161.976	13.9473	1.27827	0.109651	0.00802995	0.000399933	-7.22845e-06	-5.41232e-06	-9.1209e-07	-1.08538e-07	-9.85351e-09	-5.9282e-10	3.53844e-12	7.78876e-12
alpha_1e-10	1.15303e+11	5096.65	-2172.9	161.976	13.9473	1.27827	0.109651	0.00802995	0.000399933	-7.22845e-06	-5.41232e-06	-9.1209e-07	-1.08538e-07	-9.85351e-09	-5.9282e-10	3.53844e-12	7.78876e-12
alpha_1e-08	1.15303e+11	5096.65	-2172.9	161.976	13.9473	1.27827	0.109651	0.00802994	0.000399933	-7.22845e-06	-5.41232e-06	-9.1209e-07	-1.08538e-07	-9.85351e-09	-5.9282e-10	3.53845e-12	7.78875e-12
alpha_0.0001	1.15309e+11	5086.41	-2169.21	161.689	13.945	1.27786	0.10959	0.00802178	0.000398943	-7.19091e-06	-5.36877e-06	-9.0993e-07	-1.08467e-07	-9.84847e-09	-5.91887e-10	3.59354e-12	7.72374e-12
alpha_0.001	1.15364e+11	4994.29	-2135.98	159.105	13.9248	1.27416	0.109037	0.00794826	0.000390038	-6.66863e-06	-4.99836e-06	-8.90449e-07	-1.07825e-07	-9.80269e-09	-5.83438e-10	3.90161e-12	7.15889e-12
alpha_0.01	1.15927e+11	4098.21	-1810.4	133.378	13.7385	1.23923	0.103767	0.00724083	0.000303609	-0	-1.44875e-06	-6.93935e-07	-1.0227e-07	-9.46971e-09	-5.09555e-10	0	2.42032e-12
alpha_1	1.29792e+11	-1178.53	0	0	15.516	1.19202	0.0278028	0	0	0	0	0	0	0	-0	-0	-0
alpha_5	1.99234e+11	227.502	0	0	9.8794	1.21121	0	0	0	0	0	0	0	0	0	0	0
alpha_10	4.17803e+11	2004.29	0	0	4.05398	0.800881	0	0	0	0	0	0	0	0	0	0	0
alpha_20	8.58473e+11	3932.8	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	rss	intercept	coef_y_1	coef_y_2	coef_y_3	coef_y_4	coef_y_5	coef_y_6	coef_y_7	coef_y_8	coef_y_9	coef_y_10	coef_y_11	coef_y_12	coef_y_13	coef_y_14	coef_y_15
alpha_1e-15	1.75328e+11	-10542.6	1724.07	161.666	-4.33049	-0.0113652	0.000169614	2.64738e-06	2.77256e-08	2.63931e-10	2.44527e-12	2.24805e-14	2.06229e-16	1.89077e-18	1.73324e-20	1.58878e-22	1.45633e-24
alpha_1e-10	1.75328e+11	-10542.6	1724.07	161.666	-4.33049	-0.0113652	0.000169614	2.64738e-06	2.77256e-08	2.63931e-10	2.44527e-12	2.24805e-14	2.06229e-16	1.89077e-18	1.73324e-20	1.58878e-22	1.45633e-24
alpha_1e-08	1.75328e+11	-10542.6	1724.07	161.666	-4.33049	-0.0113652	0.000169614	2.64738e-06	2.77256e-08	2.63931e-10	2.44527e-12	2.24805e-14	2.06229e-16	1.89077e-18	1.73324e-20	1.58878e-22	1.45633e-24
alpha_0.0001	1.75332e+11	-10542.7	1724.32	161.603	-4.32637	-0.0113525	0.000168298	2.64858e-06	2.77414e-08	2.64077e-10	2.44661e-12	2.24927e-14	2.06341e-16	1.8918e-18	1.73418e-20	1.58964e-22	1.45712e-24
alpha_0.001	1.75375e+11	-10544	1726.5	161.042	-4.28924	-0.0112382	0.000156451	2.6593e-06	2.7884e-08	2.65386e-10	2.45861e-12	2.26027e-14	2.07349e-16	1.90104e-18	1.74265e-20	1.5974e-22	1.46424e-24
alpha_0.01	1.75842e+11	-10560.7	1749.95	155.262	-3.91429	-0.01011	3.90483e-05	2.76229e-06	2.90418e-08	2.78452e-10	2.57837e-12	2.37002e-14	2.17407e-16	1.99322e-18	1.82715e-20	1.67485e-22	1.53523e-24
alpha_1	1.97256e+11	-12690	2900.89	0	-0	-0.007874	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_5	2.88255e+11	-7571.47	2006.14	0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_10	5.06469e+11	-1740.92	989.397	0	0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_20	8.58473e+11	3932.8	0	0	0	0	0	0	0	0	0	0	0	0	0	0	0

	rss	intercept	coef_z_1	coef_z_2	coef_z_3	coef_z_4	coef_z_5	coef_z_6	coef_z_7	coef_z_8	coef_z_9	coef_z_10	coef_z_11	coef_z_12	coef_z_13	coef_z_14	coef_z_15
alpha_1e-15	1.53288e+11	-1412.57	-2320.04	1101.79	-13.8148	-0.430283	-0.00373756	-2.4733e-05	-1.43189e-07	-7.57718e-10	-3.70042e-12	-1.63473e-14	-6.00294e-17	-1.15664e-19	9.03125e-22	1.61803e-23	1.72261e-25
alpha_1e-10	1.53288e+11	-1412.57	-2320.04	1101.79	-13.8148	-0.430283	-0.00373756	-2.4733e-05	-1.43189e-07	-7.57718e-10	-3.70042e-12	-1.63473e-14	-6.00294e-17	-1.15664e-19	8.9329e-22	1.57513e-23	1.65323e-25
alpha_1e-08	1.53288e+11	-1412.57	-2320.04	1101.79	-13.8148	-0.430283	-0.00373756	-2.4733e-05	-1.43189e-07	-7.57718e-10	-3.70042e-12	-1.63473e-14	-6.00294e-17	-1.15664e-19	2.90922e-23	6.33956e-24	7.81745e-26
alpha_0.0001	1.53297e+11	-1416.2	-2317.76	1101.39	-13.7985	-0.430431	-0.00373836	-2.47373e-05	-1.43212e-07	-7.57837e-10	-3.70105e-12	-1.63506e-14	-6.00477e-17	-1.15766e-19	-0	-0	-0
alpha_0.001	1.53376e+11	-1448.93	-2297.23	1097.78	-13.6517	-0.431759	-0.00374553	-2.47759e-05	-1.43416e-07	-7.58913e-10	-3.70672e-12	-1.63809e-14	-6.02123e-17	-1.16688e-19	-0	-0	-0
alpha_0.01	1.5416e+11	-1771.17	-2094.86	1062.15	-12.195	-0.445084	-0.00381729	-2.51613e-05	-1.45454e-07	-7.69608e-10	-3.76301e-12	-1.66806e-14	-6.18409e-17	-1.25793e-19	-0	-0	-0
alpha_1	2.0867e+11	-12424.3	4622.77	0	-0	-0	-0	-6.19722e-05	-3.70571e-07	-1.92677e-09	-9.74758e-12	-4.80857e-14	-2.30452e-16	-1.05968e-18	-4.53349e-21	-1.65702e-23	-3.4708e-26
alpha_5	2.94438e+11	-7473.42	3223.25	0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_10	5.12653e+11	-1650.27	1577.7	0	0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0
alpha_20	8.58473e+11	3932.8	0	0	0	0	0	-0	-0	-0	-0	-0	-0	-0	-0	-0	-0

	rss	intercept	coef_cut	coef_table	coef_depth	coef_color	coef_clarity	coef_x	coef_y	coef_z
alpha_1e-15	1.4342e+11	-28551.5	130.644	-1.90243	117.747	247.12	543.245	3373.59	180.224	19.4235
alpha_1e-10	1.4342e+11	-28551.5	130.644	-1.90243	117.747	247.12	543.245	3373.59	180.224	19.4235
alpha_1e-08	1.4342e+11	-28551.5	130.644	-1.90243	117.747	247.12	543.245	3373.59	180.224	19.4235
alpha_1e-05	1.4342e+11	-28551.7	130.642	-1.902	117.75	247.119	543.243	3373.58	180.275	19.3568
alpha_0.0001	1.4342e+11	-28554.5	130.621	-1.90215	117.8	247.106	543.232	3374.04	180.493	18.2406
alpha_0.001	1.4342e+11	-28572.4	130.382	-1.92465	118.155	246.971	543.123	3380.53	179.69	8.8834
alpha_0.01	1.43422e+11	-28369.9	128.227	-1.73562	115.61	244.858	540.881	3373.49	177.978	16.0154
alpha_1	1.61603e+11	-15725.1	0	-0	0	20.0376	290.918	2930.51	86.059	147.103
alpha_5	2.59692e+11	-8161.09	0	0	0	-0	0	2110.2	0	0
alpha_10	4.77906e+11	-2228.13	-0	0	-0	-0	-0	1074.99	0	0

	rss	intercept	coef_y_1	coef_y_2	coef_y_3	coef_y_4	coef_y_5
alpha_1e-15	1.14511e+11	16756.1	-7797.8	893.166	18.9312	-2.08403	0.0261961
alpha_1e-10	1.14511e+11	16757.4	-7798.6	893.333	18.917	-2.0836	0.0261921
alpha_1e-08	1.14512e+11	16889	-7876.49	909.468	17.5478	-2.04191	0.0258063
alpha_0.0001	1.18115e+11	15355	-8381.97	1281.3	-32.7043	-0.190943	0.00708802
alpha_0.001	1.52414e+11	-2096.91	-1470.72	495.261	-11.1961	-0.0725074	0.00216946
alpha_0.01	1.79954e+11	-10114.2	1712.28	134.651	-1.60008	-0.0209603	4.04066e-05
alpha_1	3.20684e+11	-4780.55	1276.24	41.5481	-0.0739631	-0.00494618	-9.12077e-05
alpha_5	6.07078e+11	689.76	465.963	16.5518	0.0310892	-0.000777597	-1.63654e-05
alpha_10	7.06952e+11	2094.17	262.017	9.65151	0.0305004	-0.000204816	-5.27959e-06
alpha_20	7.73639e+11	2943.71	140.126	5.29484	0.0212824	-2.06805e-05	-1.31901e-06

	rss	intercept	coef_z_1	coef_z_2	coef_z_3	coef_z_4	coef_z_5
alpha_1e-15	1.25151e+11	8713.35	3935.36	-7838.93	2632.71	-266.181	6.00674
alpha_1e-10	1.25151e+11	8724.62	3910.54	-7823.67	2628.96	-265.823	5.99872
alpha_1e-08	1.25276e+11	9707.45	1745.99	-6492.9	2302.4	-234.551	5.29896
alpha_0.0001	1.3485e+11	11528.7	-9652.38	2104.98	-12.3425	-0.977369	-0.0133735
alpha_0.001	1.37155e+11	6697.97	-6555.72	1534.34	13.1517	-0.749821	-0.0308426
alpha_0.01	1.61201e+11	-4507.31	-241.312	700.485	5.41481	-0.358917	-0.015254
alpha_1	3.07557e+11	-4721.09	1913.9	145.057	0.16192	-0.0589968	-0.00220667
alpha_5	5.88795e+11	570.533	732.191	57.8353	0.41369	-0.00924792	-0.000412318
alpha_10	6.93673e+11	2005.89	416.249	33.6933	0.317321	-0.00237298	-0.000142372
alpha_20	7.65426e+11	2889.53	224.051	18.4606	0.20225	-0.00018407	-4.18318e-05