Data and prediction through classification

A key value that you can extract from raw data is the ability to build a model you can use for prediction or classification. Applying this model, you can translate an observation into a prediction or classification that has obvious value. This tutorial continues the tenets introduced in Part 1: “Data, structure, and the data science pipeline,” exploring two methods for classification (a form of prediction) using supervised learning and unsupervised learning.

Classification is a common application of machine-learning algorithms. You can implement prediction and classification by using supervised learning, where you use prior observations to build a model that predicts an outcome based on unseen observations. You can also implement it by using unsupervised learning, where you cluster observations into sets to predict the behavior of new observations based on the cluster to which they are closest.

Machine learning has been applied successfully to many prediction and classification problems, including flight delays, credit scores, and stock prices. Here, I explore two interesting algorithms: probabilistic neural networks and density-based clustering (DBSCAN).

Probabilistic neural networks

Probabilistic neural networks (PNNs) were created in 1966 and are similar to the back-propagation neural network I discussed in “A neural networks deep dive.” Both are multilayer, feed-forward neural networks, but rather than rely on back-error propagation, PNNs use a different training method. You can use PNNs for pattern recognition and in clustering and classification as well as prediction.

PNN architecture

The PNN is a supervised learning approach and relies on a training set with labels (classes identified for each observation). The PNN is made up of four layers, as shown below.

Figure 1. PNN architecture
Line diagram of a typically PNN architecture

The input layer represents the input vector, with a dimension of the number of features for the problem. The pattern layer, also called the hidden layer, consists of a neuron for each observation in the training data set. The pattern neuron calculates the Euclidean distance from the training sample (represented by the neuron) and the input feature vector, which is represented as the centroid of the feature vectors for the class. The summation layer consists of a neuron for each class the data set represents. As the name implies, neurons in the summation layer calculate the sum of the pattern layer neuron’s output for the specific class they represent (which includes an average based on the number of observations for the class). Finally, the output layer, called the decision layer, implements a winner-takes-all approach: It identifies the class neuron in the summation layer with the largest value. This class then represents the predicted class for the input vector.

PNN algorithm

The algorithm for training and using a PNN is simple, as you’ll soon see in a bit of code. No training is required because computing a prediction simply iterates the entire data set to calculate the neurons in the summation layer. Contrast this with traditional back-propagation, where weights are adjusted after training each observation. Therefore, back-propagation networks can be faster to use but might take longer to train (as a function of the size of the data set).

The process begins with an example feature vector (representing the observation for which you want to predict its class) and the data set that contains the labeled observations. For each neuron in the summation layer (representing a class), you iterate the data set and calculate the squared sum of the example feature vector and each data set observation for the class, then apply a smoothing factor. When all summation neurons have calculated from the represented observations in the data set (for the class), the largest value is chosen in a winner-takes-all approach. This class represents the predicted class for the example.

PNN implementation

You can find the full implementation of this PNN on GitHub. Listing 1 provides the classifier function, which accepts an example observation and returns the class to which it should belong (the prediction). As discussed earlier, this code starts from the summation layer and calculates the summation neurons over the data set observations that are part of the class, using the example observation. The summation neurons are then passed to a function called winner_takes_all, which simply determines which neuron had the largest value (the winner).

Listing 01. PNN classifier code in C

int pnn_classifier( dataset_t example )
{
   int class, observation, feature, class_observations;
   double product;
   double summation[CLASSES];

   // Iterate the neurons in the summation layer.
   for ( class = 1 ; class <= CLASSES ; class++ )
   {
      summation[class‑1] = 0.0;
      class_observations = 0;

      // Iterate the dataset observations (pattern layer).
      for ( observation = 0 ; observation < OBSERVATIONS ; observation++ )
      {
         product = 0.0;

         // Train the class output node based upon samples for this class.
         if ( dataset[observation].class == class )
         {
            class_observations++;

            // Square product of the example vector and the input feature vector
            for ( feature = 0 ; feature < FEATURES ; feature++ ) {
              product += SQR( ( example.features[feature] ‑
                                dataset[observation].features[feature] ) );
            }

            // Apply a smoothing factor and sum into the class neuron.
            summation[class‑1] += exp( ‑( product / ( 2.0 * SQR( SIGMA ) ) ) );
         }

      }

      summation[class‑1] /= (double)class_observations;
   }

   return winner_takes_all( summation );
}
    

Now you can see the PNN in action when I provide it with some data for prediction. In this case, I return to my zoo data set, which contains a set of animals based on their features (hair, feathers, lays eggs, airborne, etc.). I’ll construct a few sample observations to see what the algorithm predicts for the class (mammal, bug, amphibian, etc.). Listing 2 provides the metadata for the data set (feature vector description and class description).

Listing 02. Zoo data set metadata

// Features are:
//    [ 0] hair
//    [ 1] feathers
//    [ 2] eggs
//    [ 3] milk
//    [ 4] airborne
//    [ 5] aquatic
//    [ 6] predator
//    [ 7] toothed
//    [ 8] backbone
//    [ 9] breathes
//    [10] venomous
//    [11] fins
//    [12] legs
//    [13] tail
//    [14] domestic
//    [15] catsize

// Classes are [1] Mammal, [2] Bird, [3] Reptile, [4] Fish
//             [5] Amphibian, [6] Bug, [7] Invertebrate.
    

Listing 3 shows the sample code of example feature vectors and output illustrating the class prediction by using the PNN classifier. This output shows that a flea with teeth is still a bug; a predator with hair is a mammal; and an aquatic, egg-laying, six-legged creature is an invertebrate.

Listing 03. Example feature vectors and prediction

dataset_t example1 = {"flea with teeth",
                      {0,0,1,0,0,0,0,1,0,1,0,0,6,0,0,0}, 0};
printf("Prediction %d for %s\n", pnn_classifier( example1 ), example1.name );

dataset_t example2 = {"predator with hair",
                      {0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0}, 0};
printf("Prediction %d for %s\n", pnn_classifier( example2 ), example2.name );

dataset_t example3 = {"six legged aquatic egg layer",
                      {0,0,1,0,0,1,0,0,0,0,0,0,6,0,0,0}, 0};
printf("Prediction %d for %s\n", pnn_classifier( example3 ), example3.name );

...

Classification 6 for flea with teeth
Classification 1 for predator with hair
Classification 7 for six legged aquatic egg layer
    

Data-based clustering

DBSCAN was created in 1996 and is a straightforward approach to clustering that can automatically determine the number of natural clusters. The approach is robust against outlier samples (which DBSCAN calls noise). You can learn more about clustering algorithms in “Unsupervised learning for data classification.” DBSCAN, which was originally defined for use in a database, is famous for identifying irregular (sized and shaped) clusters and works for data that _k-_means is unable to satisfactorily cluster.

DBSCAN is unsupervised and attempts to provide structure to data by using the data’s hidden features. It uses two parameters: the minimum number of points with an epsilon distance.

DBSCAN algorithm

DBSCAN operates on points in a multidimensional space. Each point is represented by a feature vector, and given two vectors, you can use a metric to calculate the distance between them (such as Euclidean distance).

The algorithm begins with a point, and you identify the number of neighbors within a given distance (called epsilon) to the original point. If the number of neighbors meets or exceeds a threshold (called minimum points or MINPTS), then you call this point a core point and associate it with a new cluster. Otherwise, the point is labeled as noise. For each neighbor point (called a density-reachable point), you continue this operation and cluster the points as long as they continue to meet the criteria (MINPTS points within epsilon distance from this point). If you reach a point that has no density-reachable points to it, then that point is made part of the cluster but not used beyond this to incorporate new points.

Any points that don’t meet the criteria (MINPTS within epsilon distance) are defined as noise, although they can eventually be moved into a cluster if they’re found to be density-reachable from another point in a cluster.

Look at this process in the context of Figure 2. I begin in the left image (titled Start), where I pick a point and check its neighborhood. With MINPTS == 2, my point satisfies this criterion and is added to the cluster. The two neighboring (density-reachable) points are added for test, and the process continues. As long as each point has at least two neighbors within epsilon distance, they are added to the cluster and called core points). Points that are density-reachable but do not meet the density criteria themselves are non-core points (for example, the blue point). The algorithm finishes at the right side (titled End). The green points are all part of the cluster (core points) along with the blue point (which is a non-core point but is density reachable). The red point at the upper right is an outlier and considered noise.

Figure 2. DBSCAN process from beginning to end
Diagram with colored points to show the DBSCAN process

Next, I take you into the core of a DBSCAN implementation.

DBSCAN implementation

You can find this implementation of DBSCAN on GitHub. I discuss the two key functions here that implement the core of the algorithm. The remaining functions handle distance calculation and neighbor memory management.

The dbscan function (see Listing 4) implements the outer loop of the clustering algorithm. It iterates through the data set and determines whether the observation satisfies the density function (MINPTS observations within epsilon distance). If not, it’s labeled noise; otherwise, a cluster is created and its neighbors processed.

Listing 04. The main dbscan function

int dbscan( void )
{
   int cluster = 0;

   for ( int i = 0 ; i < OBSERVATIONS ; i++ )
   {
      if ( dataset[ i ].label != UNDEFINED ) continue;

      neighbors_t *neighbors = find_neighbors( i );

      if ( neighbors‑>neighbor_count < MINPTS )
      {
         dataset[ i ].label = NOISE;
         free_neighbors( neighbors );
         continue;
      }

      // Create a new cluster.
      dataset[ i ].label = ++cluster;

      process_neighbors( i, neighbors  );

      free_neighbors( neighbors );
   }

   return cluster;
}
    

The process_neighbor function (see Listing 5) builds a cluster around a point that satisfies the density constraint. It iterates the list of neighbors (called the seed set) to see if they fit within the current cluster. Observations that are previously defined as noise are loaded into the cluster, and observations that are already loaded into a cluster are ignored. Neighbors of the current (not-to-be-ignored) observation are then checked for neighbors; any neighbors of this observation (that meet the criteria) are loaded into the seed set to inspect for this cluster. This process continues until no new observations are loaded into the seed set, and all observations in the seed set have been checked. Upon return, the dbscan function picks the next observation and continues the process (potentially adding new clusters).

Listing 05. The process_neighbor function that follows density-reachable points

void process_neighbors( int initial_point, neighbors_t seed_set )
{
   // Process every member in the seed set.
   for ( int i = 0 ; i < OBSERVATIONS ; i++ )
   {
      // Is this a neighbor?
      if ( seed_set‑>neighbor[ i ] )
      {
         seed_set‑>neighbor[ i ] = 0;

         if ( dataset[ i ].label == NOISE )
         {
            dataset[ i ].label = dataset[ initial_point ].label;
         }
         else if ( dataset[ i ].label != UNDEFINED )
         {
            continue;
         }

         dataset[ i ].label = dataset[ initial_point ].label;

         neighbors_t neighbors = find_neighbors( i );

         if ( neighbors‑>neighbor_count >= MINPTS )
         {
            // Fold new neighbors into the seed_set
            fold_neighbors( seed_set, neighbors );
            i = 0;
         }
         free_neighbors( neighbors );
      }
   }

   return;
}
    

In fewer than 50 lines of core C code, I have a simple clustering solution with significant advantages over more traditional approaches. Consider DBSCAN using the prior zoo data set: DBSCAN must be tuned to pick the right MINPTS and epsilon. In this example, I chose an epsilon of 1.7 and a MINPTS of 4. With these parameters, DBSCAN creates six classes, with 14 observations occupying noise and an overall accuracy of 74 percent. It perfectly clustered the bird and fish category, but split the mammal cluster into one large and one small cluster.

Using the prior test cases, it clustered both the “flea with teeth” and “six-legged aquatic egg layer” in the bug category, and the “predator with hair” into noise. Tuning the epsilon distance and MINPTS parameter might improve the accuracy of this data set.

Going further

Prediction and classification are two important aspects of machine learning that have a multitude of applications. From predicting the behavior of consumers (based on their classification into a class of similar consumers) to predicting the risk of an insurance policy (based on features describing the policy and applicant), prediction through classification is a real-world example of machine learning in action.