"<p>In this lab session we are going to learn about loading and exploring a dataset in Python. We will also use the sklearn library to create a simple decision tree classifier.</p>\n",
"<p>The dataset we are going to use is <b>Iris flowers dataset</b>, which is well-known dataset for demonstrating ML and pattern recognition algorithms. The data set contains 3 classes of flowers with 50 instances for each</p>\n",
"<p>More details about the dataset can be found from <a href=\"https://en.wikipedia.org/wiki/Iris_flower_data_set\">here</a></p>"
"<p> 2. Learn how to load and explore a dataset and to understand its structure using statistical summaries and data visualization </p>\n",
"<p> 3. Use sklearn to learn and evaluate a simple Decision Tree classifier. </p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h2> <font color=\"blue\"> About the Iris dataset</font> </h2>\n",
"<p> The dataset contains 150 observations of Iris flowers. There are four columns of measurements of the flowers in centimeters: 'sepal-length', 'sepal-width', 'petal-length', 'petal-width'. The fifth column is the class of the observed flower: Setosa, Versicolor and Virginica. </p> <i>Please refer the image below for more information.</i>\n",
"<p>First, let’s import all the modules we are going to use to load the dataset. We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.</p>\n",
"<p><b>You may need to install librariles (e.g. pandas) if you get error messages. These could be installed using pip as you did for Jupyter. Please inform the lab demonstrators if you need help with this. </b></p>"
"<p>We are using pandas to load the data and explore the data. Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.</p>\n",
"<p><b> You can also download <a href=\"https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv\">iris.csv</a> file into your working directory and load it using the same method, changing URL to the local file name. </b></p>"
"<p>Note that from the above histograms; two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.</p>"
"<p>We start with some multivariate plots, that is, interactions between the variables.\n",
"<p><b>First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.</b></p>\n",
"<p><i>We have to import the package <b>scatter_matrix</b> from the <b>pandas</b>.</i></p>"
"<h2><font color=\"blue\"> Learning and evaluating a Decision Tree classifier </font></h2>\n",
"\n",
"<p>We will use the scikit-learn library to build a decision tree model. We will be using the iris dataset to build a decision tree classifier. The task is to predict the class of the iris plant based on the attributes.</p>"
"<h3> <font color=\"blue\"> Creating train and test datasets</font> </h3>\n",
"<p>We need to know that the model we created is good.</p>\n",
"\n",
"<p>We split the loaded dataset into two partitions, <b>75%</b> of which we will use to train, evaluate and select among our models, and <b>25%</b> that we will hold back as a test dataset.</p>\n",
"\n",
"<p>We have to use <b>'train_test_split'</b> function from the package <b>sklearn</b></p>"
"You now have training data in the <b>X_train</b> and <b>Y_train</b> for preparing models and a <b>X_validation</b> and <b>Y_validation</b> sets that we can use later."
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3> <font color=\"blue\"> Building a model </font> </h3>\n",
"<p>We want to import the <b>DecisionTreeClassifier</b> function from the <b>sklearn</b> library. Next, we will set the 'criterion' to 'entropy', which sets the measure for splitting the attribute to information gain.</p>"
"</style><div id=\"sk-container-id-2\" class=\"sk-top-container\"><div class=\"sk-text-repr-fallback\"><pre>DecisionTreeClassifier(criterion='entropy')</pre><b>In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. <br />On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.</b></div><div class=\"sk-container\" hidden><div class=\"sk-item\"><div class=\"sk-estimator fitted sk-toggleable\"><input class=\"sk-toggleable__control sk-hidden--visually\" id=\"sk-estimator-id-2\" type=\"checkbox\" checked><label for=\"sk-estimator-id-2\" class=\"sk-toggleable__label fitted sk-toggleable__label-arrow fitted\"> DecisionTreeClassifier<a class=\"sk-estimator-doc-link fitted\" rel=\"noreferrer\" target=\"_blank\" href=\"https://scikit-learn.org/1.5/modules/generated/sklearn.tree.DecisionTreeClassifier.html\">?<span>Documentation for DecisionTreeClassifier</span></a><span class=\"sk-estimator-doc-link fitted\">i<span>Fitted</span></span></label><div class=\"sk-toggleable__content fitted\"><pre>DecisionTreeClassifier(criterion='entropy')</pre></div> </div></div></div></div>"
],
"text/plain": [
"DecisionTreeClassifier(criterion='entropy')"
]
},
"execution_count": 41,
"metadata": {},
"output_type": "execute_result"
}
],
"source": [
"DTree.fit(X_train, y_train)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3> <font color=\"blue\"> Make predictions </font> </h3>\n",
"<p>Now, we will use the trained classifier/ model to predict the labels of the test set.</p>"
]
},
{
"cell_type": "code",
"execution_count": 42,
"metadata": {},
"outputs": [],
"source": [
"y_pred = DTree.predict(X_test)"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3> <font color=\"blue\"> Evaluate a model </font> </h3>\n",
"<p>We can evaluate the predictions of the model by comparing them to the expected results, then calculate the overall classification accuracy, as well as a confusion matrix or a classification report.</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 1. Accuracy\n",
"\n",
"<p> First, we have to import the package for 'accuracy_score'</p>"
]
},
{
"cell_type": "code",
"execution_count": 43,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy Score on train data: 1.0\n"
]
}
],
"source": [
"from sklearn.metrics import accuracy_score\n",
"print('Accuracy Score on train data:', accuracy_score(y_true=y_train, y_pred=DTree.predict(X_train)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 2. Confusion Matrix\n",
"\n",
"<p> Similarly, we can import and use the package for 'confusion_matrix'</p>\n"
]
},
{
"cell_type": "code",
"execution_count": 44,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Confusion Matrox of train data: [[35 0 0]\n",
" [ 0 42 0]\n",
" [ 0 0 35]]\n"
]
}
],
"source": [
"from sklearn.metrics import confusion_matrix\n",
"print('Confusion Matrox of train data: ', confusion_matrix(y_true=y_train, y_pred=DTree.predict(X_train)))"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"##### 3. Classification Report\n",
"<p>The classification report displays the precision, recall, F1, and support scores for the model.</p>\n",
"<p> This is implemented in the package 'classification_report'</p>"
" <p>As you probably noticed we only evaluated the model on the training data. Modify the code above to evaluate the model on the test dataset and generate accuracy value, confusion matrix and classification report. Use the code cell below for this purpose</p>"
]
},
{
"cell_type": "code",
"execution_count": 46,
"metadata": {},
"outputs": [
{
"name": "stdout",
"output_type": "stream",
"text": [
"Accuracy Score on test data: 0.9736842105263158\n",
"Confusion Matrix of test data: [[15 0 0]\n",
" [ 0 8 0]\n",
" [ 0 1 14]]\n",
" precision recall f1-score support\n",
"\n",
" Iris-setosa 1.00 1.00 1.00 15\n",
"Iris-versicolor 0.89 1.00 0.94 8\n",
" Iris-virginica 1.00 0.93 0.97 15\n",
"\n",
" accuracy 0.97 38\n",
" macro avg 0.96 0.98 0.97 38\n",
" weighted avg 0.98 0.97 0.97 38\n",
"\n"
]
}
],
"source": [
"# Answer to Exercise 1\n",
"# Evaluate the model on the test dataset\n",
"\n",
"print('Accuracy Score on test data:', accuracy_score(y_true=y_test, y_pred=DTree.predict(X_test)))\n",
"print('Confusion Matrix of test data: ', confusion_matrix(y_true=y_test, y_pred=DTree.predict(X_test)))\n",
" <p>Compare the results from evaluating the model on the training and test data and explain your observations. Use the markdown cell below to provide your answer to Excercise 2</p>"
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3>Answer to Exercise 2 </h3>\n",
" <p>Use this space to provide your answer to Exercise 2\n",
"</p>\n",
"\n",
"Accuracy of 1.0 in training data indicates the perfect fit and generalization and the 0.97 on test data can be a measure of excellent prediction. However it points out every possibilty of overfitting as the model is too complex and is tracing the train data exactly as such. \n",
" \n",
" "
]
},
{
"cell_type": "markdown",
"metadata": {},
"source": [
"<h3><font color=\"red\">Exercise 3 </h3>\n",
" <p>Try to improve the accuracy of the model on the test dataset by tuning the parameters of the decision tree learning algorithm. One of those parameters is <b>'min_samples_split'</b>, which is the minimum number of samples required to split an internal node. Its default value is equal to 2 because we cannot split on a node containing only one example/ sample. More details about 'min_samples_split' and other parameters can be found from <a href=\"https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html\">here</a></p>\n",
" \n",
"<p>Use the code cell below to write your code for Excercises 3</p>"
"<h3><font color=\"red\">Save your notebook after completing the excercises and submit it to SurreyLearn (Assignments -> Lab Exercises - Week 1) as a python notebook file in ipynb formt. </h3>"
]
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.10.12"
}
},
"nbformat": 4,
"nbformat_minor": 4
}
%% Cell type:markdown id: tags:
<h1>MLDM Lab week 1: Introduction to Iris dataset and using a simple Decision Tree classifier from the sklearn library</h1>
%% Cell type:markdown id: tags:
<h3><fontcolor="blue"> Introduction </h3>
<p>In this lab session we are going to learn about loading and exploring a dataset in Python. We will also use the sklearn library to create a simple decision tree classifier.</p>
<p>The dataset we are going to use is <b>Iris flowers dataset</b>, which is well-known dataset for demonstrating ML and pattern recognition algorithms. The data set contains 3 classes of flowers with 50 instances for each</p>
<p>More details about the dataset can be found from <ahref="https://en.wikipedia.org/wiki/Iris_flower_data_set">here</a></p>
%% Cell type:markdown id: tags:
<h3><fontcolor="blue"> Lab goals</font></h3>
<p> 1. Introduction to the Iris dataset </p>
<p> 2. Learn how to load and explore a dataset and to understand its structure using statistical summaries and data visualization </p>
<p> 3. Use sklearn to learn and evaluate a simple Decision Tree classifier. </p>
%% Cell type:markdown id: tags:
<h2><fontcolor="blue"> About the Iris dataset</font></h2>
<p> The dataset contains 150 observations of Iris flowers. There are four columns of measurements of the flowers in centimeters: 'sepal-length', 'sepal-width', 'petal-length', 'petal-width'. The fifth column is the class of the observed flower: Setosa, Versicolor and Virginica. </p><i>Please refer the image below for more information.</i>
<p>First, let’s import all the modules we are going to use to load the dataset. We are using pandas to load the data. We will also use pandas next to explore the data both with descriptive statistics and data visualization.</p>
<p><b>You may need to install librariles (e.g. pandas) if you get error messages. These could be installed using pip as you did for Jupyter. Please inform the lab demonstrators if you need help with this. </b></p>
%% Cell type:code id: tags:
``` python
frompandasimportread_csv
importpandasaspd
importnumpyasnp
```
%% Cell type:markdown id: tags:
<h3><fontcolor="blue"> Load dataset </font></h3>
<p>We are using pandas to load the data and explore the data. Note that we are specifying the names of each column when loading the data. This will help later when we explore the data.</p>
<p><b> You can also download <ahref="https://raw.githubusercontent.com/jbrownlee/Datasets/master/iris.csv">iris.csv</a> file into your working directory and load it using the same method, changing URL to the local file name. </b></p>
<b>We can also create a histogram of each input variable to get an idea of the distribution.</b>
%% Cell type:code id: tags:
``` python
dataset.hist()
pyplot.show()
```
%% Output
%% Cell type:markdown id: tags:
<p>Note that from the above histograms; two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.</p>
<p>We start with some multivariate plots, that is, interactions between the variables.
<p><b>First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.</b></p>
<p><i>We have to import the package <b>scatter_matrix</b> from the <b>pandas</b>.</i></p>
%% Cell type:code id: tags:
``` python
frompandas.plottingimportscatter_matrix
scatter_matrix(dataset)
pyplot.show()
```
%% Output
%% Cell type:markdown id: tags:
<h2><fontcolor="blue"> Learning and evaluating a Decision Tree classifier </font></h2>
<p>We will use the scikit-learn library to build a decision tree model. We will be using the iris dataset to build a decision tree classifier. The task is to predict the class of the iris plant based on the attributes.</p>
<h3><fontcolor="blue"> Creating train and test datasets</font></h3>
<p>We need to know that the model we created is good.</p>
<p>We split the loaded dataset into two partitions, <b>75%</b> of which we will use to train, evaluate and select among our models, and <b>25%</b> that we will hold back as a test dataset.</p>
<p>We have to use <b>'train_test_split'</b> function from the package <b>sklearn</b></p>
You now have training data in the <b>X_train</b> and <b>Y_train</b> for preparing models and a <b>X_validation</b> and <b>Y_validation</b> sets that we can use later.
%% Cell type:markdown id: tags:
<h3><fontcolor="blue"> Building a model </font></h3>
<p>We want to import the <b>DecisionTreeClassifier</b> function from the <b>sklearn</b> library. Next, we will set the 'criterion' to 'entropy', which sets the measure for splitting the attribute to information gain.</p>
%% Cell type:code id: tags:
``` python
fromsklearn.treeimportDecisionTreeClassifier
DTree=DecisionTreeClassifier(criterion='entropy')
```
%% Cell type:markdown id: tags:
<h3><fontcolor="blue"> Training a model </font></h3>
<p>Next, we will fit the classifier on the train attributes and labels.</p>
%% Cell type:code id: tags:
``` python
DTree.fit(X_train,y_train)
```
%% Output
DecisionTreeClassifier(criterion='entropy')
%% Cell type:markdown id: tags:
<h3><fontcolor="blue"> Make predictions </font></h3>
<p>Now, we will use the trained classifier/ model to predict the labels of the test set.</p>
%% Cell type:code id: tags:
``` python
y_pred=DTree.predict(X_test)
```
%% Cell type:markdown id: tags:
<h3><fontcolor="blue"> Evaluate a model </font></h3>
<p>We can evaluate the predictions of the model by comparing them to the expected results, then calculate the overall classification accuracy, as well as a confusion matrix or a classification report.</p>
%% Cell type:markdown id: tags:
##### 1. Accuracy
<p> First, we have to import the package for 'accuracy_score'</p>
%% Cell type:code id: tags:
``` python
fromsklearn.metricsimportaccuracy_score
print('Accuracy Score on train data:',accuracy_score(y_true=y_train,y_pred=DTree.predict(X_train)))
```
%% Output
Accuracy Score on train data: 1.0
%% Cell type:markdown id: tags:
##### 2. Confusion Matrix
<p> Similarly, we can import and use the package for 'confusion_matrix'</p>
%% Cell type:code id: tags:
``` python
fromsklearn.metricsimportconfusion_matrix
print('Confusion Matrox of train data: ',confusion_matrix(y_true=y_train,y_pred=DTree.predict(X_train)))
```
%% Output
Confusion Matrox of train data: [[35 0 0]
[ 0 42 0]
[ 0 0 35]]
%% Cell type:markdown id: tags:
##### 3. Classification Report
<p>The classification report displays the precision, recall, F1, and support scores for the model.</p>
<p> This is implemented in the package 'classification_report'</p>
<p>As you probably noticed we only evaluated the model on the training data. Modify the code above to evaluate the model on the test dataset and generate accuracy value, confusion matrix and classification report. Use the code cell below for this purpose</p>
%% Cell type:code id: tags:
``` python
# Answer to Exercise 1
# Evaluate the model on the test dataset
print('Accuracy Score on test data:',accuracy_score(y_true=y_test,y_pred=DTree.predict(X_test)))
print('Confusion Matrix of test data: ',confusion_matrix(y_true=y_test,y_pred=DTree.predict(X_test)))
<p>Compare the results from evaluating the model on the training and test data and explain your observations. Use the markdown cell below to provide your answer to Excercise 2</p>
%% Cell type:markdown id: tags:
<h3>Answer to Exercise 2 </h3>
<p>Use this space to provide your answer to Exercise 2
</p>
Accuracy of 1.0 in training data indicates the perfect fit and generalization and the 0.97 on test data can be a measure of excellent prediction. However it points out every possibilty of overfitting as the model is too complex and is tracing the train data exactly as such.
%% Cell type:markdown id: tags:
<h3><fontcolor="red">Exercise 3 </h3>
<p>Try to improve the accuracy of the model on the test dataset by tuning the parameters of the decision tree learning algorithm. One of those parameters is <b>'min_samples_split'</b>, which is the minimum number of samples required to split an internal node. Its default value is equal to 2 because we cannot split on a node containing only one example/ sample. More details about 'min_samples_split' and other parameters can be found from <ahref="https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html">here</a></p>
<p>Use the code cell below to write your code for Excercises 3</p>
%% Cell type:code id: tags:
``` python
# Answer to Exercise 3
# tune the parameters of the decision tree to increase its accuracy on the test data
<h3><fontcolor="red">Save your notebook after completing the excercises and submit it to SurreyLearn (Assignments -> Lab Exercises - Week 1) as a python notebook file in ipynb formt. </h3>