Skip to content
Snippets Groups Projects
Commit 5f7b7be1 authored by Suresh, Vishnu (PG/T - Comp Sci & Elec Eng)'s avatar Suresh, Vishnu (PG/T - Comp Sci & Elec Eng)
Browse files

Upload New File

parent c8c00b06
Branches master
No related tags found
No related merge requests found
%% Cell type:markdown id: tags:
<h1>MLDM Lab week 1: Introduction to Iris dataset and using a simple Decision Tree classifier from the sklearn library</h1>
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Introduction </h3>
<p>In this lab session we are going to learn about loading and exploring a dataset in Python. We will also use the sklearn library to create a simple decision tree classifier.</p>
<p>The dataset we are going to use is <b>Iris flowers dataset</b>, which is well-known dataset for demonstrating ML and pattern recognition algorithms. The data set contains 3 classes of flowers with 50 instances for each</p>
<p>More details about the dataset can be found from <a href="https://en.wikipedia.org/wiki/Iris_flower_data_set">here</a></p>
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Lab goals</font> </h3>
<p> 1. Introduction to the Iris dataset </p>
<p> 2. Learn how to load and explore a dataset and to understand its structure using statistical summaries and data visualization </p>
<p> 3. Use sklearn to learn and evaluate a simple Decision Tree classifier. </p>
%% Cell type:markdown id: tags:
<h2> <font color="blue"> About the Iris dataset</font> </h2>
<p> The dataset contains 150 observations of Iris flowers. There are four columns of measurements of the flowers in centimeters: 'sepal-length', 'sepal-width', 'petal-length', 'petal-width'. The fifth column is the class of the observed flower: Setosa, Versicolor and Virginica. </p> <i>Please refer the image below for more information.</i>
<img src="https://miro.medium.com/max/1100/0*SHhnoaaIm36pc1bd" width=500></img>
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Import libraries </font> </h3>
<p>First, let’s import all the modules we are going to use to load the dataset. We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.</p>
<p><b>You may need to install librariles (e.g. pandas) if you get error messages. These could be installed using pip as you did for Jupyter. Please inform the lab demonstrators if you need help with this. </b></p>
%% Cell type:code id: tags:
``` python
from pandas import read_csv
import pandas as pd
import numpy as np
```
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Load dataset </font> </h3>
<p>We are using pandas to load the data and explore the data. Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.</p>
<p><b> You can also download <a href="https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv">iris.csv</a> file into your working directory and load it using the same method, changing URL to the local file name. </b></p>
%% Cell type:code id: tags:
``` python
url = "https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv"
names = ['sepal-length', 'sepal-width', 'petal-length', 'petal-width', 'class']
dataset = read_csv(url, names=names)
```
%% Cell type:markdown id: tags:
<h2> <font color="blue"> Summarize the dataset</font> </h2>
<p>In this step we are going to take a look at the dataset in different ways:</p>
<p>1. Dimensions of the dataset.</p>
<p>2. Peek at the data itself.</p>
<p>3. Statistical summary of all attributes.</p>
<p>4. Breakdown of the data by the class variable.</p>
%% Cell type:markdown id: tags:
<h3> <font color="blue"> 1. Dimensions of the dataset </font> </h3>
<p>We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.</p>
%% Cell type:code id: tags:
``` python
print(dataset.shape)
```
%% Output
(150, 5)
%% Cell type:markdown id: tags:
<h3> <font color="blue"> 2. Peek at the data </font> </h3>
<p>We can get a quick idea about the contents of the data</p>
%% Cell type:code id: tags:
``` python
# First 5 Data
print("\nFirst 5 rows of the dataset\n")
print(dataset.head(5))
# Last 5 Data
print("\nLast 5 rows of the dataset\n")
print(dataset.tail(5))
```
%% Output
First 5 rows of the dataset
sepal-length sepal-width petal-length petal-width class
0 5.1 3.5 1.4 0.2 Iris-setosa
1 4.9 3.0 1.4 0.2 Iris-setosa
2 4.7 3.2 1.3 0.2 Iris-setosa
3 4.6 3.1 1.5 0.2 Iris-setosa
4 5.0 3.6 1.4 0.2 Iris-setosa
Last 5 rows of the dataset
sepal-length sepal-width petal-length petal-width class
145 6.7 3.0 5.2 2.3 Iris-virginica
146 6.3 2.5 5.0 1.9 Iris-virginica
147 6.5 3.0 5.2 2.0 Iris-virginica
148 6.2 3.4 5.4 2.3 Iris-virginica
149 5.9 3.0 5.1 1.8 Iris-virginica
%% Cell type:markdown id: tags:
<h3> <font color="blue"> 3. . Statistical summary </font> </h3>
<p>Now we can take a look at a summary of each attribute.
This includes the count, mean, the min and max values as well as some percentiles.</p>
%% Cell type:code id: tags:
``` python
print(dataset.describe())
```
%% Output
sepal-length sepal-width petal-length petal-width
count 150.000000 150.000000 150.000000 150.000000
mean 5.843333 3.054000 3.758667 1.198667
std 0.828066 0.433594 1.764420 0.763161
min 4.300000 2.000000 1.000000 0.100000
25% 5.100000 2.800000 1.600000 0.300000
50% 5.800000 3.000000 4.350000 1.300000
75% 6.400000 3.300000 5.100000 1.800000
max 7.900000 4.400000 6.900000 2.500000
%% Cell type:markdown id: tags:
<h3> <font color="blue"> 4. Class distribution </font> </h3>
<p>Let’s now take a look at the number of instances (rows) that belong to each class. We can view this as an absolute count.</p>
%% Cell type:code id: tags:
``` python
# class distribution
print("\nClass Distribution\n")
print(dataset.groupby('class').size())
```
%% Output
Class Distribution
class
Iris-setosa 50
Iris-versicolor 50
Iris-virginica 50
dtype: int64
%% Cell type:markdown id: tags:
<h2> <font color="blue"> Data visualization </font></h2>
<p>We now have a basic idea about the data. We need to extend that with some visualizations.</p>
<p>We are going to look at two types of plots:</p>
<p>1. Univariate plots to better understand each attribute.</p>
<p>2. Multivariate plots to better understand the relationships between attributes.</p>
%% Cell type:markdown id: tags:
<h3> <font color="blue"> 1. Univariate plots </font> </h3>
<p>We start with some univariate plots, that is, plots of each individual variable.
Given that the input variables are numeric, we can create box and whisker plots of each.</p>
<p><b>These plots give us a much clearer idea of the distribution of the input attributes</b></p>
<p><i>We have to import the package <b>pylot</b> from the <b>matplotlib</b>.</i></p>
%% Cell type:code id: tags:
``` python
from matplotlib import pyplot
dataset.plot(kind='box', subplots=True, layout=(2,2), sharex=False, sharey=False)
pyplot.show()
```
%% Output
%% Cell type:markdown id: tags:
<b>We can also create a histogram of each input variable to get an idea of the distribution.</b>
%% Cell type:code id: tags:
``` python
dataset.hist()
pyplot.show()
```
%% Output
%% Cell type:markdown id: tags:
<p>Note that from the above histograms; two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.</p>
%% Cell type:markdown id: tags:
<h3> <font color="blue"> 2. Multivariate plots </font> </h3>
<p>We start with some multivariate plots, that is, interactions between the variables.
<p><b>First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.</b></p>
<p><i>We have to import the package <b>scatter_matrix</b> from the <b>pandas</b>.</i></p>
%% Cell type:code id: tags:
``` python
from pandas.plotting import scatter_matrix
scatter_matrix(dataset)
pyplot.show()
```
%% Output
%% Cell type:markdown id: tags:
<h2><font color="blue"> Learning and evaluating a Decision Tree classifier </font></h2>
<p>We will use the scikit-learn library to build a decision tree model. We will be using the iris dataset to build a decision tree classifier. The task is to predict the class of the iris plant based on the attributes.</p>
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Extracting attributes & labels </font> </h3>
<p>We have already imported the iris data in the variable 'dataset'. We will now extract the attribute data(X) and the corresponding labels(y).</p>
%% Cell type:code id: tags:
``` python
# Extracting data attributes
X = dataset.values[:,0:4]
print (X)
# Extracting target/ class labels
y = dataset.values[:,4]
print (y)
```
%% Output
[[5.1 3.5 1.4 0.2]
[4.9 3.0 1.4 0.2]
[4.7 3.2 1.3 0.2]
[4.6 3.1 1.5 0.2]
[5.0 3.6 1.4 0.2]
[5.4 3.9 1.7 0.4]
[4.6 3.4 1.4 0.3]
[5.0 3.4 1.5 0.2]
[4.4 2.9 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.4 3.7 1.5 0.2]
[4.8 3.4 1.6 0.2]
[4.8 3.0 1.4 0.1]
[4.3 3.0 1.1 0.1]
[5.8 4.0 1.2 0.2]
[5.7 4.4 1.5 0.4]
[5.4 3.9 1.3 0.4]
[5.1 3.5 1.4 0.3]
[5.7 3.8 1.7 0.3]
[5.1 3.8 1.5 0.3]
[5.4 3.4 1.7 0.2]
[5.1 3.7 1.5 0.4]
[4.6 3.6 1.0 0.2]
[5.1 3.3 1.7 0.5]
[4.8 3.4 1.9 0.2]
[5.0 3.0 1.6 0.2]
[5.0 3.4 1.6 0.4]
[5.2 3.5 1.5 0.2]
[5.2 3.4 1.4 0.2]
[4.7 3.2 1.6 0.2]
[4.8 3.1 1.6 0.2]
[5.4 3.4 1.5 0.4]
[5.2 4.1 1.5 0.1]
[5.5 4.2 1.4 0.2]
[4.9 3.1 1.5 0.1]
[5.0 3.2 1.2 0.2]
[5.5 3.5 1.3 0.2]
[4.9 3.1 1.5 0.1]
[4.4 3.0 1.3 0.2]
[5.1 3.4 1.5 0.2]
[5.0 3.5 1.3 0.3]
[4.5 2.3 1.3 0.3]
[4.4 3.2 1.3 0.2]
[5.0 3.5 1.6 0.6]
[5.1 3.8 1.9 0.4]
[4.8 3.0 1.4 0.3]
[5.1 3.8 1.6 0.2]
[4.6 3.2 1.4 0.2]
[5.3 3.7 1.5 0.2]
[5.0 3.3 1.4 0.2]
[7.0 3.2 4.7 1.4]
[6.4 3.2 4.5 1.5]
[6.9 3.1 4.9 1.5]
[5.5 2.3 4.0 1.3]
[6.5 2.8 4.6 1.5]
[5.7 2.8 4.5 1.3]
[6.3 3.3 4.7 1.6]
[4.9 2.4 3.3 1.0]
[6.6 2.9 4.6 1.3]
[5.2 2.7 3.9 1.4]
[5.0 2.0 3.5 1.0]
[5.9 3.0 4.2 1.5]
[6.0 2.2 4.0 1.0]
[6.1 2.9 4.7 1.4]
[5.6 2.9 3.6 1.3]
[6.7 3.1 4.4 1.4]
[5.6 3.0 4.5 1.5]
[5.8 2.7 4.1 1.0]
[6.2 2.2 4.5 1.5]
[5.6 2.5 3.9 1.1]
[5.9 3.2 4.8 1.8]
[6.1 2.8 4.0 1.3]
[6.3 2.5 4.9 1.5]
[6.1 2.8 4.7 1.2]
[6.4 2.9 4.3 1.3]
[6.6 3.0 4.4 1.4]
[6.8 2.8 4.8 1.4]
[6.7 3.0 5.0 1.7]
[6.0 2.9 4.5 1.5]
[5.7 2.6 3.5 1.0]
[5.5 2.4 3.8 1.1]
[5.5 2.4 3.7 1.0]
[5.8 2.7 3.9 1.2]
[6.0 2.7 5.1 1.6]
[5.4 3.0 4.5 1.5]
[6.0 3.4 4.5 1.6]
[6.7 3.1 4.7 1.5]
[6.3 2.3 4.4 1.3]
[5.6 3.0 4.1 1.3]
[5.5 2.5 4.0 1.3]
[5.5 2.6 4.4 1.2]
[6.1 3.0 4.6 1.4]
[5.8 2.6 4.0 1.2]
[5.0 2.3 3.3 1.0]
[5.6 2.7 4.2 1.3]
[5.7 3.0 4.2 1.2]
[5.7 2.9 4.2 1.3]
[6.2 2.9 4.3 1.3]
[5.1 2.5 3.0 1.1]
[5.7 2.8 4.1 1.3]
[6.3 3.3 6.0 2.5]
[5.8 2.7 5.1 1.9]
[7.1 3.0 5.9 2.1]
[6.3 2.9 5.6 1.8]
[6.5 3.0 5.8 2.2]
[7.6 3.0 6.6 2.1]
[4.9 2.5 4.5 1.7]
[7.3 2.9 6.3 1.8]
[6.7 2.5 5.8 1.8]
[7.2 3.6 6.1 2.5]
[6.5 3.2 5.1 2.0]
[6.4 2.7 5.3 1.9]
[6.8 3.0 5.5 2.1]
[5.7 2.5 5.0 2.0]
[5.8 2.8 5.1 2.4]
[6.4 3.2 5.3 2.3]
[6.5 3.0 5.5 1.8]
[7.7 3.8 6.7 2.2]
[7.7 2.6 6.9 2.3]
[6.0 2.2 5.0 1.5]
[6.9 3.2 5.7 2.3]
[5.6 2.8 4.9 2.0]
[7.7 2.8 6.7 2.0]
[6.3 2.7 4.9 1.8]
[6.7 3.3 5.7 2.1]
[7.2 3.2 6.0 1.8]
[6.2 2.8 4.8 1.8]
[6.1 3.0 4.9 1.8]
[6.4 2.8 5.6 2.1]
[7.2 3.0 5.8 1.6]
[7.4 2.8 6.1 1.9]
[7.9 3.8 6.4 2.0]
[6.4 2.8 5.6 2.2]
[6.3 2.8 5.1 1.5]
[6.1 2.6 5.6 1.4]
[7.7 3.0 6.1 2.3]
[6.3 3.4 5.6 2.4]
[6.4 3.1 5.5 1.8]
[6.0 3.0 4.8 1.8]
[6.9 3.1 5.4 2.1]
[6.7 3.1 5.6 2.4]
[6.9 3.1 5.1 2.3]
[5.8 2.7 5.1 1.9]
[6.8 3.2 5.9 2.3]
[6.7 3.3 5.7 2.5]
[6.7 3.0 5.2 2.3]
[6.3 2.5 5.0 1.9]
[6.5 3.0 5.2 2.0]
[6.2 3.4 5.4 2.3]
[5.9 3.0 5.1 1.8]]
['Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa' 'Iris-setosa'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor' 'Iris-versicolor'
'Iris-versicolor' 'Iris-versicolor' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica'
'Iris-virginica' 'Iris-virginica' 'Iris-virginica' 'Iris-virginica']
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Creating train and test datasets</font> </h3>
<p>We need to know that the model we created is good.</p>
<p>We split the loaded dataset into two partitions, <b>75%</b> of which we will use to train, evaluate and select among our models, and <b>25%</b> that we will hold back as a test dataset.</p>
<p>We have to use <b>'train_test_split'</b> function from the package <b>sklearn</b></p>
%% Cell type:code id: tags:
``` python
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state = 47, test_size = 0.25)
```
%% Cell type:markdown id: tags:
You now have training data in the <b>X_train</b> and <b>Y_train</b> for preparing models and a <b>X_validation</b> and <b>Y_validation</b> sets that we can use later.
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Building a model </font> </h3>
<p>We want to import the <b>DecisionTreeClassifier</b> function from the <b>sklearn</b> library. Next, we will set the 'criterion' to 'entropy', which sets the measure for splitting the attribute to information gain.</p>
%% Cell type:code id: tags:
``` python
from sklearn.tree import DecisionTreeClassifier
DTree = DecisionTreeClassifier(criterion = 'entropy')
```
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Training a model </font> </h3>
<p>Next, we will fit the classifier on the train attributes and labels.</p>
%% Cell type:code id: tags:
``` python
DTree.fit(X_train, y_train)
```
%% Output
DecisionTreeClassifier(criterion='entropy')
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Make predictions </font> </h3>
<p>Now, we will use the trained classifier/ model to predict the labels of the test set.</p>
%% Cell type:code id: tags:
``` python
y_pred = DTree.predict(X_test)
```
%% Cell type:markdown id: tags:
<h3> <font color="blue"> Evaluate a model </font> </h3>
<p>We can evaluate the predictions of the model by comparing them to the expected results, then calculate the overall classification accuracy, as well as a confusion matrix or a classification report.</p>
%% Cell type:markdown id: tags:
##### 1. Accuracy
<p> First, we have to import the package for 'accuracy_score'</p>
%% Cell type:code id: tags:
``` python
from sklearn.metrics import accuracy_score
print('Accuracy Score on train data:', accuracy_score(y_true=y_train, y_pred=DTree.predict(X_train)))
```
%% Output
Accuracy Score on train data: 1.0
%% Cell type:markdown id: tags:
##### 2. Confusion Matrix
<p> Similarly, we can import and use the package for 'confusion_matrix'</p>
%% Cell type:code id: tags:
``` python
from sklearn.metrics import confusion_matrix
print('Confusion Matrox of train data: ', confusion_matrix(y_true=y_train, y_pred=DTree.predict(X_train)))
```
%% Output
Confusion Matrox of train data: [[35 0 0]
[ 0 42 0]
[ 0 0 35]]
%% Cell type:markdown id: tags:
##### 3. Classification Report
<p>The classification report displays the precision, recall, F1, and support scores for the model.</p>
<p> This is implemented in the package 'classification_report'</p>
%% Cell type:code id: tags:
``` python
from sklearn.metrics import classification_report
print(classification_report(y_true=y_train, y_pred=DTree.predict(X_train)))
```
%% Output
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 35
Iris-versicolor 1.00 1.00 1.00 42
Iris-virginica 1.00 1.00 1.00 35
accuracy 1.00 112
macro avg 1.00 1.00 1.00 112
weighted avg 1.00 1.00 1.00 112
%% Cell type:markdown id: tags:
<h3><font color="red">Exercise 1 </h3>
<p>As you probably noticed we only evaluated the model on the training data. Modify the code above to evaluate the model on the test dataset and generate accuracy value, confusion matrix and classification report. Use the code cell below for this purpose</p>
%% Cell type:code id: tags:
``` python
# Answer to Exercise 1
# Evaluate the model on the test dataset
print('Accuracy Score on test data:', accuracy_score(y_true=y_test, y_pred=DTree.predict(X_test)))
print('Confusion Matrix of test data: ', confusion_matrix(y_true=y_test, y_pred=DTree.predict(X_test)))
print(classification_report(y_true=y_test, y_pred=DTree.predict(X_test)))
```
%% Output
Accuracy Score on test data: 0.9736842105263158
Confusion Matrix of test data: [[15 0 0]
[ 0 8 0]
[ 0 1 14]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 15
Iris-versicolor 0.89 1.00 0.94 8
Iris-virginica 1.00 0.93 0.97 15
accuracy 0.97 38
macro avg 0.96 0.98 0.97 38
weighted avg 0.98 0.97 0.97 38
%% Cell type:markdown id: tags:
<h3><font color="red">Exercise 2 </h3>
<p>Compare the results from evaluating the model on the training and test data and explain your observations. Use the markdown cell below to provide your answer to Excercise 2</p>
%% Cell type:markdown id: tags:
<h3>Answer to Exercise 2 </h3>
<p>Use this space to provide your answer to Exercise 2
</p>
Accuracy of 1.0 in training data indicates the perfect fit and generalization and the 0.97 on test data can be a measure of excellent prediction. However it points out every possibilty of overfitting as the model is too complex and is tracing the train data exactly as such.
%% Cell type:markdown id: tags:
<h3><font color="red">Exercise 3 </h3>
<p>Try to improve the accuracy of the model on the test dataset by tuning the parameters of the decision tree learning algorithm. One of those parameters is <b>'min_samples_split'</b>, which is the minimum number of samples required to split an internal node. Its default value is equal to 2 because we cannot split on a node containing only one example/ sample. More details about 'min_samples_split' and other parameters can be found from <a href="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">here</a></p>
<p>Use the code cell below to write your code for Excercises 3</p>
%% Cell type:code id: tags:
``` python
# Answer to Exercise 3
# tune the parameters of the decision tree to increase its accuracy on the test data
import warnings
warnings.filterwarnings('ignore')
from sklearn.model_selection import RandomizedSearchCV
DTree = DecisionTreeClassifier()
hyper_p = {
'max_depth': [2,3,4,5,10,15],
'min_samples_split': [2,6,10,14,15,20,25,30],
'min_samples_leaf':[2,3,5,7,9,11],
'criterion': ['gini', 'entropy'],
'max_features': ['auto', 'sqrt', 'log2']
}
DTclass = RandomizedSearchCV(DTree, param_distributions=hyper_p, n_iter=100, cv=5, n_jobs=-1)
DTclass.fit(X_train, y_train)
print(f"Best parameters: {DTclass.best_params_}")
print('Accuracy Score on test data:', accuracy_score(y_true=y_test, y_pred=DTclass.predict(X_test)))
print('Confusion Matrox of test data: ', confusion_matrix(y_true=y_test, y_pred=DTclass.predict(X_test)))
print(classification_report(y_true=y_test, y_pred=DTclass.predict(X_test)))
```
%% Output
Best parameters: {'min_samples_split': 14, 'min_samples_leaf': 2, 'max_features': 'log2', 'max_depth': 5, 'criterion': 'entropy'}
Accuracy Score on test data: 0.9473684210526315
Confusion Matrox of test data: [[15 0 0]
[ 0 8 0]
[ 0 2 13]]
precision recall f1-score support
Iris-setosa 1.00 1.00 1.00 15
Iris-versicolor 0.80 1.00 0.89 8
Iris-virginica 1.00 0.87 0.93 15
accuracy 0.95 38
macro avg 0.93 0.96 0.94 38
weighted avg 0.96 0.95 0.95 38
%% Cell type:markdown id: tags:
<h3><font color="red">Save your notebook after completing the excercises and submit it to SurreyLearn (Assignments -> Lab Exercises - Week 1) as a python notebook file in ipynb formt. </h3>
0% Loading or .
You are about to add 0 people to the discussion. Proceed with caution.
Finish editing this message first!
Please register or to comment