# Python: Machine learning - Scikit-learn Exercises, Practice, Solution

## Python Machine learning Iris flower data set [35 exercises with solution]

[** An editor is available at the bottom of the page to write and execute the scripts.** Go to the editor]

Scikit-learn is a free machine learning library for the Python programming language. It features various classification, regression and clustering algorithms including support vector machines, random forests, gradient boosting, k-means and DBSCAN, and is designed to interoperate with the Python numerical and scientific libraries NumPy and SciPy.

The best way to learn is to practice and answer exercises. We have started this section for those (beginner to intermediate) familiar with Python and Scikit-learn. Hope these exercises help you to improve your Machine Learning skills using Scikit-learn. Currently, the following sections are available. We are working hard to add more exercises .... Happy Coding!

**Iris flower data set**

From Wikipedia - The Iris flower data set or Fisher's Iris data set is a multivariate data set introduced by the British statistician and biologist Ronald Fisher in his 1936 paper The use of multiple measurements in taxonomic problems as an example of linear discriminant analysis. It is sometimes called Anderson's Iris data set because Edgar Anderson collected the data to quantify the morphologic variation of Iris flowers of three related species. Two of the three species were collected in the Gaspé Peninsula "all from the same pasture, and picked on the same day and measured at the same time by the same person with the same apparatus"

The data set consists of 50 samples from each of three species of Iris (Iris setosa, Iris virginica and Iris versicolor). Four features were measured from each sample: the length and the width of the sepals and petals, in centimeters. Based on the combination of these four features, Fisher developed a linear discriminant model to distinguish the species from each other.

## Basic - Iris flower data set [8 exercises with solution]

**1. ** Write a Python program to load the iris data from a given csv file into a dataframe and print the shape of the data, type of the data and first 3 rows.

Click me to see the sample solution

**2. ** Write a Python program using Scikit-learn to print the keys, number of rows-columns, feature names and the description of the Iris data.

Click me to see the sample solution

**3. ** Write a Python program to get the number of observations, missing values and nan values.

Click me to see the sample solution

**4. ** Write a Python program to create a 2-D array with ones on the diagonal and zeros elsewhere. Now convert the NumPy array to a SciPy sparse matrix in CSR format.

From wikipedia :

In numerical analysis and scientific computing, a sparse matrix or sparse array is a matrix in which most of the elements are zero. By contrast, if most of the elements are nonzero, then the matrix is considered dense. The number of zero-valued elements divided by the total number of elements (e.g., m x n for an m x n matrix) is called the sparsity of the matrix (which is equal to 1 minus the density of the matrix). Using those definitions, a matrix will be sparse when its sparsity is greater than 0.5.

Click me to see the sample solution

**5. ** Write a Python program to view basic statistical details like percentile, mean, std etc. of iris data.

Click me to see the sample solution

**6. ** Write a Python program to get observations of each species (setosa, versicolor, virginica) from iris data.

Click me to see the sample solution

**7. ** Write a Python program to drop Id column from a given Dataframe and print the modified part. Call iris.csv to create the Dataframe.

Click me to see the sample solution

**8. ** Write a Python program to access first four cells from a given Dataframe using the index and column labels. Call iris.csv to create the Dataframe.

Click me to see the sample solution

## Visualization - Iris flower data set [16 exercises with solution]

**1.** Write a Python program to create a plot to get a general Statistics of Iris data.

Click me to see the sample solution

**2. ** Write a Python program to create a Bar plot to get the frequency of the three species of the Iris data.

Click me to see the sample solution

**3. ** Write a Python program to create a Pie plot to get the frequency of the three species of the Iris data.

Click me to see the sample solution

**4. ** Write a Python program to create a graph to find relationship between the sepal length and width.

Click me to see the sample solution

**5. ** Write a Python program to create a graph to find relationship between the petal length and width.

Click me to see the sample solution

**6. ** Write a Python program to create a graph to see how the length and width of SepalLength, SepalWidth, PetalLength, PetalWidth are distributed.

Click me to see the sample solution

**7. ** Write a Python program to create a joinplot to describe individual distributions on the same plot between Sepal length and Sepal width.

Note: joinplot - Draw a plot of two variables with bivariate and univariate graphs.

Click me to see the sample solution

**8. ** Write a Python program to create a joinplot using "hexbin" to describe individual distributions on the same plot between Sepal length and Sepal width.

Note:

The bivariate analogue of a histogram is known as a "hexbin" plot, because it shows the counts of observations that fall within hexagonal bins. This plot works best with relatively large datasets. It's available through the matplotlib plt.hexbin function and as a style in jointplot(). It looks best with a white background.

Click me to see the sample solution

**9. ** Write a Python program to create a joinplot using "kde" to describe individual distributions on the same plot between Sepal length and Sepal width.

Note:

The kernel density estimation (kde) procedure visualize a bivariate distribution. In seaborn, this kind of plot is shown with a contour plot and is available as a style in jointplot().

Click me to see the sample solution

**10. ** Write a Python program to create a joinplot and add regression and kernel density fits using "reg" to describe individual distributions on the same plot between Sepal length and Sepal width.

Click me to see the sample solution

**11. ** Write a Python program to draw a scatterplot, then add a joint density estimate to describe individual distributions on the same plot between Sepal length and Sepal width.

Click me to see the sample solution

**12. ** Write a Python program to create a joinplot using "kde" to describe individual distributions on the same plot between Sepal length and Sepal width and use '+' sign as marker.

Note:

The kernel density estimation (kde) procedure visualize a bivariate distribution. In seaborn, this kind of plot is shown with a contour plot and is available as a style in jointplot().

Click me to see the sample solution

**13. ** Write a Python program to create a pairplot of the iris data set and check which flower species seems to be the most separable.

Click me to see the sample solution

**14. ** Write a Python program to find the correlation between variables of iris data. Also create a hitmap using Seaborn to present their relations.

Click me to see the sample solution

**15. ** Write a Python program to create a box plot (or box-and-whisker plot) which shows the distribution of quantitative data in a way that facilitates comparisons between variables or across levels of a categorical variable of iris dataset. Use seaborn.

Click me to see the sample solution

**16. ** From Wikipedia -

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables (entities each of which takes on various numerical values) into a set of values of linearly uncorrelated variables called principal components. This transformation is defined in such a way that the first principal component has the largest possible variance (that is, accounts for as much of the variability in the data as possible), and each succeeding component in turn has the highest variance possible under the constraint that it is orthogonal to the preceding components. The resulting vectors (each being a linear combination of the variables and containing n observations) are an uncorrelated orthogonal basis set. PCA is sensitive to the relative scaling of the original variables.

Write a Python program to create a Principal component analysis (PCA) of iris dataset.

Click me to see the sample solution

## K-Nearest Neighbors Algorithm in Iris flower data set [8 exercises with solution]

From Wikipedia,

In pattern recognition, the k-nearest neighbors algorithm (k-NN) is a non-parametric method used for classification and regression.[1] In both cases, the input consists of the k closest training examples in the feature space. The output depends on whether k-NN is used for classification or regression:

- itemscope itemtype="http://schema.org/WebPageElement/Heading"> In k-NN classification, the output is a class membership. An object is classified by a plurality vote of its neighbors, with the object being assigned to the class most common among its k nearest neighbors (k is a positive integer, typically small). If k = 1, then the object is simply assigned to the class of that single nearest neighbor.
- itemscope itemtype="http://schema.org/WebPageElement/Heading"> In k-NN regression, the output is the property value for the object. This value is the average of the values of k nearest neighbors.

Example of k-NN classification. The test sample (green dot) should be classified either to blue squares or to red triangles. If k = 3 (solid line circle) it is assigned to the red triangles because there are 2 triangles and only 1 square inside the inner circle. If k = 5 (dashed line circle) it is assigned to the blue squares (3 squares vs. 2 triangles inside the outer circle).

**1.** Write a Python program to split the iris dataset into its attributes (X) and labels (y). The X variable contains the first four columns (i.e. attributes) and y contains the labels of the dataset.

Click me to see the sample solution

**2.** Write a Python program using Scikit-learn to split the iris dataset into 70% train data and 30% test data. Out of total 150 records, the training set will contain 120 records and the test set contains 30 of those records. Print both datasets.

Click me to see the sample solution

**3.** Write a Python program using Scikit-learn to convert Species columns in a numerical column of the iris dataframe. To encode this data map convert each value to a number. e.g. Iris-setosa:0, Iris-versicolor:1, and Iris-virginica:2. Now print the iris dataset into 80% train data and 20% test data. Out of total 150 records, the training set will contain 120 records and the test set contains 30 of those records. Print both datasets.

Click me to see the sample solution

**4.** Write a Python program using Scikit-learn to split the iris dataset into 70% train data and 30% test data. Out of total 150 records, the training set will contain 105 records and the test set contains 45 of those records. Predict the response for test dataset (SepalLengthCm, SepalWidthCm, PetalLengthCm, PetalWidthCm) using the K Nearest Neighbor Algorithm. Use 5 as number of neighbors.

Click me to see the sample solution

**5.** Write a Python program using Scikit-learn to split the iris dataset into 80% train data and 20% test data. Out of total 150 records, the training set will contain 120 records and the test set contains 30 of those records. Train or fit the data into the model and calculate the accuracy of the model using the K Nearest Neighbor Algorithm.

Click me to see the sample solution

**6.** Write a Python program using Scikit-learn to split the iris dataset into 80% train data and 20% test data. Out of total 150 records, the training set will contain 120 records and the test set contains 30 of those records. Train or fit the data into the model and using the K Nearest Neighbor Algorithm calculate the performance for different values of k.

Click me to see the sample solution

**7.** Write a Python program using Scikit-learn to split the iris dataset into 80% train data and 20% test data. Out of total 150 records, the training set will contain 120 records and the test set contains 30 of those records. Train or fit the data into the model and using the K Nearest Neighbor Algorithm and create a plot to present the performance for different values of k.

Click me to see the sample solution

**8.** Write a Python program using Scikit-learn to split the iris dataset into 80% train data and 20% test data. Out of total 150 records, the training set will contain 120 records and the test set contains 30 of those records. Train or fit the data into the model and using the K Nearest Neighbor Algorithm and create a plot of k values vs accuracy.

Click me to see the sample solution

## Logistic Regression in Sci-Kit Learn [3 exercises with solution]

**1.** Write a Python program to view some basic statistical details like percentile, mean, std etc. of the species of 'Iris-setosa', 'Iris-versicolor' and 'Iris-virginica'.

Click me to see the sample solution

**2.** Write a Python program to create a scatter plot using sepal length and petal_width to separate the Species classes.

Click me to see the sample solution

**3.** In statistical modeling, regression analysis is a set of statistical processes for estimating the relationships among variables. It includes many techniques for modeling and analyzing several variables, when the focus is on the relationship between a dependent variable and one or more independent variables (or 'predictors').

Write a Python program to get the accuracy of the Logistic Regression.

Click me to see the sample solution

**Python Code Editor:**

[ Want to contribute to Python exercises? Send your code (attached with a .zip file) to us at w3resource[at]yahoo[dot]com. Please avoid copyrighted materials.]

**Weekly Trends and Language Statistics**- Weekly Trends and Language Statistics