Overview

Skill Level: Intermediate

You need to approach the QiSkit

We will see how a classical clusterization problema can be mapped into a quantum domain preserving topological meaningfulness. This is a two chapter recipe we (Ennio Picarelli and myself) built. Let's start with classical part.

Ingredients

You need following:

  • a python environment both local (for example ANACONDA) or, better, a Watson Studio account. We suppose you know the concept of Juppiter notebook and the next recipe's steps are notebbok cells
  • in PART 2 you will use QisKit i.e. the interface libraries with IBM Q on cloud so you will need the access to IBM Q and to Qiskit (but you will see it in Part 2)
  • we assume you know the general concept of K-Mean clustering

Second part of this recipe is in K-means clusterization algorithm with Quantum Circuit - Part 2 by Ennio Picarelli

Data are availabe at:Link to Data in Google Drive

Step-by-step

  1. Let's have a look to K-means algorithm

    After the linear regression faced in the previous notebook we now deal with K-means. K-means is an unsupervised machine learning algorithm for identifying homogeneous groups in a data set. Unsupervised because the training phase of the model does not involve the use of labelled data but the recognition of groups takes place autonomously. The fields of use are numerous and vary from the recognition of similar behaviors in consumers (market segmentation) to design custom marketing campaigns, the grouping of documents talking of the same topic (clusters) to the reduction of the size of image files with a reduction of the basic colors, etc..

    In this work we will show some applications of K-means in the classification of data starting from the use of classical libraries present on Python and arriving at the use of a quantum circuit that promises a significant theoretical processing speed-up.

    In extreme synthesis, the algorithm is based on the identification of an optimal number of groups in which to divide the data, from the iterative identification of the centers (centroids) of these groups in the n-dimensional space of the features of the data and from the contextual association of the same data to the various groups. The algorithm is iterative and starts from an initial random assignment of the centroids, then groups the data for greater proximity to the centroids, recalculates the new centroids as the midpoint, for each group, of the data belonging to it and starts again. When the position of the centroids and the belonging of the data to the groups does not vary, it is assumed that the algorithm converges.

    The objective of K-means optimization is to minimize the quadratic error of the dispersion of the points with respect to the centroids, which is equivalent to minimizing the Euclidean distance from the centroids themselves. In the following we will see some mathematical details of the algorithm.

    The right definition of the K number of centroids can be supported by means of some tools, the most common is the Elbow method

    In this method the value of J is calculated for various numbers of K and the elbow point of the curve will be choosen as the right K value for calculation (in the curve case the value is 2).

  2. Example of classic clustering and quantum dataset setup

    Clustering is performed on sample data from public datasets on Kaggle related to market segmentation. Clustering is first performed on normal Cartesian coordinates. Clustering Cartesian coordinates are the two parameters of annual income and spending score The Elbow method is also used to determine the optimal number of clusters. The calculation is then repeated by transforming the calculating dataset:

    Theta = arctan(annual income/spending score)*10000

    (Multiplies by 10,000 to improve accuracy)

    Then clustering is performed with respect to the dimensions customer ID and Theta

    It is observed that both the number of clusters and the clustering maintain the topological validity 20 points are chosen to perform the quantum calculation (only 20 points are chosen to avoid processing too long)

    #importing the libraries
    import numpy as np # linear algebra
    import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
    import matplotlib.pyplot as plt #Data Visualization
    import seaborn as sns #Python library for Vidualization
    #reading the dataset
    import sys
    import types
    import pandas as pd
    import types
    import pandas as pd
    from botocore.client import Config
    import ibm_boto3

    def __iter__(self): return 0

    #Insert HERE te code to read your dataset. With Watson Studio a specific precoded snipped is inserted
    #here producing the "body". You need to use Mall_Customers.csv file from google drive reference in ingredient
    # section


    df_data_2 = pd.read_csv(body)
    dataset=df_data_2

    dataset.head(10) #Prints first 10 rows of the dataset

  3. Determination of the correlation between Spending Score and other parameters

    # Mapping the gender column in numerical values
    dataset['Gender'] = dataset['Gender'].astype('category')
    dataset['Gender_codes'] = dataset['Gender'].cat.codes
    # Determining the pairplot matrix for correlation verification
    import seaborn as sns
    sns.set(style="ticks", color_codes=True)
    g = sns.pairplot(dataset)

  4. Data verification and creation of the Theta column

    The most relevant features for data clusterization are Spending Score and the Annual Income and therefore we decide to derive a new characteristic that is related to the ratio between these two parameters. In particular the derivate feature will be the arctangent of the ratio.

    #total rows and colums in the dataset
    dataset.shape
    dataset.info() # there are no missing values as all the columns has 200 entries properly

    <class ‘pandas.core.frame.DataFrame’>
    RangeIndex: 200 entries, 0 to 199
    Data columns (total 6 columns):
    CustomerID 200 non-null int64
    Gender 200 non-null category
    Age 200 non-null int64
    Annual Income_k$ 200 non-null int64
    Spending Score_1_to_100 200 non-null int64
    Gender_codes 200 non-null int8
    dtypes: category(1), int64(4), int8(1)
    memory usage: 6.8 KB

    #adding theta angle for only 2 features (Annual income and Spending Score)
    dataset['Teta']=np.arctan(dataset['Annual Income_k$'].values.astype(int)/
    dataset['Spending Score_1_to_100'].values.astype(float))*10000.
    dataset.head(10)

  5. Data display of the model to be clustered

    ### Feature slelection for the model
    #Considering only 2 features (Annual income and Spending Score) and no Label available
    X= dataset.iloc[:, [3,4]].values.astype(int)
    #Visualizes data

    plt.figure(figsize=(20,10))
    plt.scatter(X[:, 0], X[:, 1], s = 50, c = 'black', label = 'Sample')
    plt.title('Clusters of customers')
    plt.xlabel('Annual Income (k$)')
    plt.ylabel('Spending Score (1-100)')
    plt.legend()
    plt.grid()
    plt.show()

    IMG7

  6. Creating the model with sklearn

    First of all we must find the right K value

    #Building the Model
    #KMeans Algorithm to decide the optimum cluster number , KMeans++ using Elbow Mmethod
    #to figure out K for KMeans, I will use ELBOW Method on KMEANS++ Calculation
    from sklearn.cluster import KMeans
    wcss=[]
    #we always assume the max number of cluster would be 10
    #you can judge the number of clusters by doing averaging
    ###Static code to get max no of clusters

    for i in range(1,21):
    kmeans = KMeans(n_clusters= i, init='k-means++', random_state=42,max_iter=10000,tol=0.000001)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

    #inertia_ is the formula used to segregate the data points into clusters
    #Visualizing the ELBOW method to get the optimal value of K 
    plt.figure(figsize=(20,10))
    plt.plot(range(1,21), wcss)
    plt.title('The Elbow Method')
    plt.xlabel('no of clusters')
    plt.ylabel('wcss')
    plt.show()

    IMG8

  7. The elbow of the curve is corresponding to 5 clusters so we will build the model for K=5

    #If you zoom out this curve then you will see that last elbow comes at k=5
    #no matter what range we select ex- (1,21) also i will see the same behaviour but if we chose higher
    #range it is little difficult to visualize the ELBOW
    #that is why we usually prefer range (1,21)
    ##Finally we got that k=5

    #Model Build
    kmeansmodel = KMeans(n_clusters= 5, init='k-means++', random_state=42,max_iter=100000,tol=0.00000001, verbose=0)
    y_kmeans= kmeansmodel.fit_predict(X)
    kmeans= kmeansmodel.fit(X)
    #showing the cluster to which the point belongs
    print('y_kmeans=',y_kmeans)
    #Centroid position
    centers_float = kmeans.cluster_centers_
    centers=np.rint(centers_float).astype(int)
    print('centers=',centers)

    y_kmeans= [2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2 3 2
    3 2 3 2 3 2 0 2 3 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
    0 0 0 0 0 0 0 0 0 0 0 0 4 1 4 0 4 1 4 1 4 0 4 1 4 1 4 1 4 1 4 0 4 1 4 1 4
    1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1 4 1
    4 1 4 1 4 1 4 1 4 1 4 1 4 1 4]centers= [[55 50] [88 17] [26 21] [26 79] [87 82]]

    #Visualizing all the clusters 

    plt.figure(figsize=(20,10))
    plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 50, c = 'red', label = 'Cluster 1')
    plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 50, c = 'blue', label = 'Cluster 2')
    plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 50, c = 'green', label = 'Cluster 3')
    plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 50, c = 'cyan', label = 'Cluster 4')
    plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 50, c = 'magenta', label = 'Cluster 5')
    plt.scatter(centers[:, 0], centers[:, 1], c='yellow', s=200, alpha=0.5,label = 'Centroids');
    plt.title('Clusters of customers')
    plt.xlabel('Annual Income (k$)')
    plt.ylabel('Spending Score (1-100)')
    plt.legend()
    plt.grid()
    plt.show()

    IMG9

     

     
     
  8. Model transformation by switching to Theta + customer ID

    Now we will show that the clusterization doesn’t change if we substitute the previous features with Theta + Customer ID

    ### Feature slelection for the model
    #Considering only 2 features (ID + Theta) and no Label available
    X= dataset.iloc[:, [0,6]].values.astype(int)
    #Visualizing data

    plt.figure(figsize=(20,10))
    plt.scatter(X[:, 0], X[:, 1], s = 50, c = 'black', label = 'Sample')
    plt.title('Clusters of customers (Teta Mode)')
    plt.xlabel('Customer')
    plt.ylabel('Teta')
    plt.legend()
    plt.grid()
    plt.show()

    IMG10

    #We can see that the elbow value for k is 5. We have tha same behavior even with wider k range
    #Model Build
    kmeansmodel = KMeans(n_clusters= 5, init='k-means++', random_state=42,max_iter=200000,tol=0.00000001, verbose=0)
    y_kmeans= kmeansmodel.fit_predict(X)
    kmeans= kmeansmodel.fit(X)
    #shows to which cluster the point belongs
    print(y_kmeans)
    #centroids position
    centers_float = kmeans.cluster_centers_
    centers=np.rint(centers_float).astype(int)
    print(centers)

    [2 2 3 2 2 2 3 2 1 2 4 2 4 2 4 2 0 2 0 2 0 2 1 2 3 2 0 2 0 2 1 2 1 2 3 2 3
    2 4 2 0 2 0 0 4 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4 0 0
    4 0 0 4 0 4 0 0 4 4 0 4 0 0 0 4 0 4 4 4 4 4 4 4 4 4 4 4 0 4 4 4 4 4 4 4 4
    4 4 4 4 4 4 4 4 4 4 4 4 0 3 0 3 0 1 0 1 0 3 0 1 0 1 0 1 0 1 0 3 0 1 0 3 0
    3 0 1 0 1 0 1 0 1 0 1 0 3 0 1 0 3 4 1 0 3 4 1 4 1 0 1 0 1 4 1 0 3 4 1 4 3
    0 1 4 1 4 1 4 1 4 1 4 1 4 1 4][[ 93 7355] [ 147 14276] [ 20 3126] [ 112 11828] [ 114 9195]]

    #Visualizing all the clusters 

    plt.figure(figsize=(20,10))
    plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 50, c = 'red', label = 'Cluster 1')
    plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 50, c = 'blue', label = 'Cluster 2')
    plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 50, c = 'green', label = 'Cluster 3')
    plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 50, c = 'cyan', label = 'Cluster 4')
    plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 50, c = 'magenta', label = 'Cluster 5')
    plt.scatter(centers[:, 0], centers[:, 1], c='yellow', s=200, alpha=0.5,label = 'Centroids');
    plt.title('Clusters of customers Teta Mode')
    plt.xlabel('Customer')
    plt.ylabel('Teta')
    plt.legend()
    plt.grid()
    plt.show()

    IMG11

  9. Reduced data loading for IBM Q simulation

    In order to perform, in the next part of the recipe, the calculation even with Quantum Circuit we need to reduce the number of data occurrences.


    #Insert HERE te code to read your dataset. With Watson Studio a specific precoded snipped is inserted
    #here producing the "body". You need to use DataForQComparison.csv file from google drive reference in ingredient
    # section

    df_data_3 = pd.read_csv(body)
    dataset=df_data_3
    #add teta angle for only 2 features (Annual income and Spending Score)
    dataset['Teta']=np.arctan(dataset['Annual Income_k$'].values.astype(int)/dataset['Spending Score_1_to_100'].values.astype(int))*10000
    dataset['ID']=dataset.reset_index().index
    dataset.head(10)

    IMG12

     

  10. Clustering on reduced data and standard features

    ### Feature sleection for the model
    #Considering only 2 features (Annual income and Spending Score) and no Label available
    X= dataset.iloc[:, [0,1]].values.astype(int)
    #Visualizing data

    plt.figure(figsize=(20,10))
    plt.scatter(X[:, 0], X[:, 1], s = 50, c = 'black', label = 'Sample')
    plt.title('Clusters of customers')
    plt.xlabel('Annual Income (k$)')
    plt.ylabel('Spending Score (1-100)')
    plt.legend()
    plt.grid()
    plt.show()

     IMG13

  11. We go directly to Clustering calculation on Theta and Customer ID

    ### Feature sleection for the model
    #Considering only 2 features (ID + Theta) and no Label available
    X= dataset.iloc[:, [3,2]].values.astype(float)
    #Visualizing data

    plt.figure(figsize=(20,10))
    plt.scatter(X[:, 0], X[:, 1], s = 50, c = 'black', label = 'Sample')
    plt.title('Clusters of customers (Teta Mode)')
    plt.xlabel('Customer')
    plt.ylabel('Teta')
    plt.legend()
    plt.grid()
    plt.show()

    IMG14

    #Building the Model
    #KMeans Algorithm to decide the optimum cluster number , KMeans++ using Elbow Mmethod
    #to figure out K for KMeans, I will use ELBOW Method on KMEANS++ Calculation
    from sklearn.cluster import KMeans
    wcss=[]
    #we always assume the max number of cluster would be 10
    #you can judge the number of clusters by doing averaging
    ###Static code to get max no of clusters

    for i in range(1,21):
    kmeans = KMeans(n_clusters= i, init='k-means++', random_state=42,max_iter=10000,tol=0.000001)
    kmeans.fit(X)
    wcss.append(kmeans.inertia_)

    #inertia_ is the formula used to segregate the data points into clusters
    #Visualizing the ELBOW method to get the optimal value of K 
    plt.figure(figsize=(20,10))
    plt.plot(range(1,21), wcss)
    plt.title('The Elbow Method')
    plt.xlabel('no of clusters')
    plt.ylabel('wcss')
    plt.show()

    IMG15

    #If you zoom out this curve then you will see that last elbow comes at k=5
    #no matter what range we select ex- (1,21) also i will see the same behaviour but if we chose higher range it is little
    #difficult to visualize the ELBOW
    #that is why we usually prefer range (1,21)
    #Finally we got that k=5
    #
    #Model Build
    kmeansmodel = KMeans(n_clusters= 5, init='k-means++', random_state=42,max_iter=100000,tol=0.00000001, verbose=0)
    y_kmeans= kmeansmodel.fit_predict(X)
    kmeans= kmeansmodel.fit(X)
    #shows to which cluster point belongs
    print(y_kmeans)

    [1 1 1 3 3 0 0 2 4 4 4 0 4 4 4 0 2 2 2 2]

    #centroid position
    centers_float = kmeans.cluster_centers_
    centers=np.rint(centers_float).astype(int)
    print(centers)

    [[ 9 9932] [ 1 2731] [ 15 13673] [ 4 4881] [ 11 7679]]

    #Visualizing all the clusters 

    plt.figure(figsize=(20,10))
    plt.scatter(X[y_kmeans == 0, 0], X[y_kmeans == 0, 1], s = 50, c = 'red', label = 'Cluster 1')
    plt.scatter(X[y_kmeans == 1, 0], X[y_kmeans == 1, 1], s = 50, c = 'blue', label = 'Cluster 2')
    plt.scatter(X[y_kmeans == 2, 0], X[y_kmeans == 2, 1], s = 50, c = 'green', label = 'Cluster 3')
    plt.scatter(X[y_kmeans == 3, 0], X[y_kmeans == 3, 1], s = 50, c = 'cyan', label = 'Cluster 4')
    plt.scatter(X[y_kmeans == 4, 0], X[y_kmeans == 4, 1], s = 50, c = 'magenta', label = 'Cluster 5')
    plt.scatter(centers[:, 0], centers[:, 1], c='yellow', s=200, alpha=0.5,label = 'Centroids');
    plt.title('Clusters of customers Teta Mode')
    plt.xlabel('Customer')
    plt.ylabel('Teta')
    plt.legend()
    plt.grid()
    plt.show()

    IMG16

  12. Conclusion

    We seen how to change variable (features) domain to adapt them to a more “Quantum aware” contest, i.e. a polar domain instead that a cartesian domain, maintaining the genral topology of clusterization.

    In part 2 we explore how apply quantum circuit to K-Mean problems.

Join The Discussion