[nolan@nprescott.com] $  cat weblog archive feed

K-Means Clustering in Klong

2019-09-01

As I continue to spiral out of control with my exploration of array programming I've recreated some rudimentary "unsupervised machine learning" from scratch.

The Goal

Because I am not an expert in either K-Means Clustering or Klong, I thought to start with at least one known quantity, the data set. I'll be using Fisher's iris data because it is so widely known and studied. I thought (correctly!) that I might need to fall back to checking my work before moving on to better or more interesting instances of clustering.

The Data

Funnily enough, the data is not a good candidate for K-Means clustering, as described here:

The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in data mining:

I will be using it due to the number of reference implemenations using it.

About the data itself, the numeric values are described as:

Data Preparation

While there is technically a CSV reader in Klong's "standard library", I couldn't work out the necessary incantation to read anything but string values. Instead I wound up with a hack-y solution below to pre-prepare the data in a compatible format:

tr ',' ' ' <iris.csv \
    | cut -d' ' -f1-4 \
    | sed -e '1d' -e 's/\(.*\)/\[\1\]/' -e '1i[' -e '$a]' > iris.dat

I think this is the first real instance in which I've experienced the limitations of using Klong rather than a more widely-used language like J or perhaps APL. Similarly, even as small as it is, the iris data has proven a little taxing on compute time.

Implementation

First, and most boring, is to read the entire pre-prepared data set into an array I'll call data. Klong specifies a "from-channel" .fc using a given "in-channel" .ic, which I've parameterized to take a filename and then just "read" .r:

readfile::{.fc(.ic(x));.r()}
data::readfile("iris.dat")

In order to initialize a cluster somewhere, I need a verb to get a random value from the data. To accomplish this, I'll define randint, which just takes the floor _ of some x multiplied by the random number verb .rn. To pick a value from the data, I can take a random integer in the range (or "tally" #) of data and "pick" (@) the row at that value. I'll do this a given number of times ('&) to parameterize the number of centroids.

randint::{_x*.rn()}
init::{[data];data::x; {data@x}'{x;randint(#data)}'&y}

Using it is as straight-forward as:

        data init 3
[[5.7 2.5 5.0 2.0]
 [6.2 3.4 5.4 2.3]
 [6.0 3.0 4.8 1.8]]

Fundamentally K-Means works by placing some random points for initial cluster positions; from there it takes the distance from each other point and adjusts the "center" of the cluster to try and minimize that distance. So first things first, I need a way to measure distance and the most common method is Euclidean distance:

distance::{(+/((x-y)^2))^0.5}

If you remember back to geometry, a "3-4-5" triangle might have a length of 3, a height of 4 and an hypotenuse of 5 — meaning the distance from point (0,0) to (3,4) is 5. To be sure, we can verify our distance verb:

        distance@[[0 0] [3 4]]
5.0

In order to get a distance value for each centroid (in the case of the Iris data set there are 3) I need to take the distance for each (') x in data using the transposed (+) centroids:

distances::{[a];a::y; distance(;+a)'x}

The transpose is necessary to match the shape of the matrices being operated on. The effect is something like this:

        centroids::data init 3
 4.6 3.4 1.4 0.3
 6.4 2.7 5.3 1.9
 7.7 2.8 6.7 2.0

        +centroids
 4.6 6.4 7.7
 3.4 2.7 2.8
 1.4 5.3 6.7
 0.3 1.9 2.0

So that in applying the distance for each row in the data, the result is:

        distance(;[[4.6 6.4 7.7] [3.4 2.7 2.8] [1.4 5.3 6.7] [0.3 1.9 2.0]]) ' data
 0.519615242270663185  4.51995575199580908  6.21128006130781385
  0.50990195135927848  4.52106182218292609   6.2617888817813077
 0.264575131106459059  4.69361268108053415   6.4467045845144789
 0.331662479035539985  4.55302097513288423  6.32297398381489465
    0.458257569495584   4.5683695122001679  6.26578007912821655
  0.99498743710661995  4.20119030752000093  5.83609458456594925
                    0  4.63680924774785187   6.3992187023104625
 0.424264068711928514  4.44747119158741713   6.1587336360651286
 ...
 (150 rows)

With a distance to each centroid established it is time to establish which cluster a row belongs to. This is done by taking the minimum-over (&/) each row in the distance result and checking for equality to "find" which was matched (as the minimum):

        cluster::{x=&/'x}

        cluster(data distances centroids)
 1 0 0
 1 0 0
 1 0 0
 1 0 0
 1 0 0
 ...

In the above example, all of the first 5 rows belong to the first cluster with the centroids provided.

To collect the cluster identifier from the cluster arrays show above I define a verb group, which "finds" (?) the first (*) 1 in each cluster:

        group::{{*x?1}'cluster(x distances y)}

        data group centroids
[0 0 0 0 0 0 0 0 0 0 0 0 ...

This alone isn't very interesting, especially in the above case, where all of the leading elements belong to the first cluster. It becomes more interesting when applied with an aggregation for each cluster value. Because centroids is itself an array, the rank (#) of it can be enumerated (!):

        !#centroids
[0 1 2]

Which, when combined with with a "find" (?), results in an aggregation of per-centroid grouping:

        {(data group centroids)?x}'!#centroids
[[0 1 2 3 4 5 6 7 8 9 10 ...
 [50 51 52 53 54 55 56 58 ...
 [105 107 109 117 118 122 ...

Of course, the grouping is only an index position into the original data array, so it makes sense to combine it with a "pick" (@) to get the actual data per-group:

        data@{(data group centroids)?x}'!#centroids
[[[5.1 3.5 1.4 0.2] [4.9 3.0 1.4 0.2] [4.7 3.2 1.3 0.2] ...
 [[7.0 3.2 4.7 1.4] [6.4 3.2 4.5 1.5] [6.9 3.1 4.9 1.5] ...
 [[7.6 3.0 6.6 2.1] [7.3 2.9 6.3 1.8] [7.2 3.6 6.1 2.5] ...

The final piece to explain is probably the easiest, K-Means is a minimization problem, and in order to have a single "thing" to minimize, we need to take the mean of each grouping in order to finish a single step in the process. Mean is defined as the sum of an array divided by the number of elements in the array:

        mean::{(+/x)%#x}

        mean([0 1 2 3 4 5 6 7 8 9 10])
5.0

I've combined the last few pieces into a single verb means to simplify the interface, specifically around applying the mean to each centroid group, and to make the final step as easy as possible:

        means::{[data centroids center]; data::x; centroids::y;
                 center::data@{(data group centroids)?x}'!#centroids;
                 mean'center}
          
        means(data;centroids)
+---------------------+---------------------+---------------------+----------------------+
| 5.00566037735849057 | 3.36037735849056604 | 1.56226415094339622 | 0.288679245283018868 |
| 6.15862068965517241 | 2.85632183908045977 | 4.79425287356321839 |  1.65057471264367816 |
|                7.54 |                3.15 |                6.39 |                 2.09 |
+---------------------+---------------------+---------------------+----------------------+

K-Means works by iteratively re-evaluating the distance and adjusting the centroids for the next iteration. The calculation is "done" when the adjustments are sufficiently small, which is in effect a fixed-point calculation. It turns out Klong has a fixed-point operation, termed convergence (:~). Using it requires a function (verb) of one argument (a monad), which is the value to be re-supplied after evaluating, to check for fixedness. In the case of means, the two arguments are "data", which does not change each time it is evaluated, and "centroids" which do. In order to converge the means, I can partially apply the "data" argument and converge over subsequent centroids:

means(data;):~data init k

In the above example, k would be the desired number of centroids for a set of data. For the iris data set I get the following:

        means(data;):~data init 3
+---------------------+---------------------+---------------------+---------------------+
|                6.85 | 3.07368421052631579 | 5.74210526315789474 | 2.07105263157894736 |
| 5.90161290322580645 | 2.74838709677419355 | 4.39354838709677419 | 1.43387096774193548 |
|               5.006 |               3.418 |               1.464 |               0.244 |
+---------------------+---------------------+---------------------+---------------------+

Viewing Cluster Centers

Of course, the above a still a bit abstract and so I couldn't resist visualizing the identified clusters:

First, from a two-dimensional projection, where the black circles are the raw data and the red dots are the identified centers:

And again with a three-dimensional projection to better visualize at least some of the extra dimensions:

Logical Next Steps

I've already been asked "Wouldn't it be better if it could find the right number of clusters itself?", which is exactly what the elbow method achieves. There is also the "curse of dimensionality" to consider, which begs for an implementation of principal component analysis or similar.

The Full Program

  readfile::{.fc(.ic(x));.r()}
  data::readfile("iris.dat")

  randint::{_x*.rn()}
  init::{[data];data::x; {data@x}'{x;randint(#data)}'&y}
  distance::{(+/((x-y)^2))^0.5}
  distances::{[a];a::y; distance(;+a)'x}
  cluster::{x=&/'x}
  group::{{*x?1}'cluster(x distances y)}
  mean::{(+/x)%#x}
  :"means(DATA;):~DATA init K"
  means::{[data centroids center]; data::x; centroids::y;
          center::data@{(data group centroids)?x}'!#centroids;
          mean'center}