As I continue to spiral out of control with my exploration of array programming I've recreated some rudimentary "unsupervised machine learning" from scratch.
Because I am not an expert in either K-Means Clustering or Klong, I thought to start with at least one known quantity, the data set. I'll be using Fisher's iris data because it is so widely known and studied. I thought (correctly!) that I might need to fall back to checking my work before moving on to better or more interesting instances of clustering.
Funnily enough, the data is not a good candidate for K-Means clustering, as described here:
The use of this data set in cluster analysis however is not common, since the data set only contains two clusters with rather obvious separation. One of the clusters contains Iris setosa, while the other cluster contains both Iris virginica and Iris versicolor and is not separable without the species information Fisher used. This makes the data set a good example to explain the difference between supervised and unsupervised techniques in data mining:
I will be using it due to the number of reference implemenations using it.
About the data itself, the numeric values are described as:
- Number of Instances: 150 (50 in each of three classes)
- Number of Attributes: 4 numeric, predictive attributes and the class
- Attribute Information:
- sepal length in cm
- sepal width in cm
- petal length in cm
- petal width in cm
- class:
- Iris Setosa
- Iris Versicolour
- Iris Virginica
While there is technically a CSV reader in Klong's "standard library", I couldn't work out the necessary incantation to read anything but string values. Instead I wound up with a hack-y solution below to pre-prepare the data in a compatible format:
tr ',' ' ' <iris.csv \
| cut -d' ' -f1-4 \
| sed -e '1d' -e 's/\(.*\)/\[\1\]/' -e '1i[' -e '$a]' > iris.dat
1,2,3
would become [1 2 3]
)I think this is the first real instance in which I've experienced the limitations of using Klong rather than a more widely-used language like J or perhaps APL. Similarly, even as small as it is, the iris data has proven a little taxing on compute time.
First, and most boring, is to read the entire pre-prepared data set
into an array I'll call data
. Klong specifies a "from-channel" .fc
using a given "in-channel" .ic
, which I've parameterized to take a
filename and then just "read" .r
:
readfile::{.fc(.ic(x));.r()}
data::readfile("iris.dat")
In order to initialize a cluster somewhere, I need a verb to get a
random value from the data. To accomplish this, I'll define randint
,
which just takes the floor _
of some x multiplied by the random
number verb .rn
. To pick a value from the data, I can take a random
integer in the range (or "tally" #
) of data and "pick" (@
) the row
at that value. I'll do this a given number of times ('&
) to
parameterize the number of centroids.
randint::{_x*.rn()}
init::{[data];data::x; {data@x}'{x;randint(#data)}'&y}
Using it is as straight-forward as:
data init 3
[[5.7 2.5 5.0 2.0]
[6.2 3.4 5.4 2.3]
[6.0 3.0 4.8 1.8]]
Fundamentally K-Means works by placing some random points for initial cluster positions; from there it takes the distance from each other point and adjusts the "center" of the cluster to try and minimize that distance. So first things first, I need a way to measure distance and the most common method is Euclidean distance:
distance::{(+/((x-y)^2))^0.5}
If you remember back to geometry, a "3-4-5" triangle might have a length of 3, a height of 4 and an hypotenuse of 5 — meaning the distance from point (0,0) to (3,4) is 5. To be sure, we can verify our distance verb:
distance@[[0 0] [3 4]]
5.0
In order to get a distance value for each centroid (in the case of the
Iris data set there are 3) I need to take the distance for each ('
)
x in data using the transposed (+
) centroids:
distances::{[a];a::y; distance(;+a)'x}
The transpose is necessary to match the shape of the matrices being operated on. The effect is something like this:
centroids::data init 3
4.6 3.4 1.4 0.3
6.4 2.7 5.3 1.9
7.7 2.8 6.7 2.0
+centroids
4.6 6.4 7.7
3.4 2.7 2.8
1.4 5.3 6.7
0.3 1.9 2.0
So that in applying the distance for each row in the data, the result is:
distance(;[[4.6 6.4 7.7] [3.4 2.7 2.8] [1.4 5.3 6.7] [0.3 1.9 2.0]]) ' data
0.519615242270663185 4.51995575199580908 6.21128006130781385
0.50990195135927848 4.52106182218292609 6.2617888817813077
0.264575131106459059 4.69361268108053415 6.4467045845144789
0.331662479035539985 4.55302097513288423 6.32297398381489465
0.458257569495584 4.5683695122001679 6.26578007912821655
0.99498743710661995 4.20119030752000093 5.83609458456594925
0 4.63680924774785187 6.3992187023104625
0.424264068711928514 4.44747119158741713 6.1587336360651286
...
(150 rows)
With a distance to each centroid established it is time to establish
which cluster a row belongs to. This is done by taking the
minimum-over (&/
) each row in the distance result and checking for
equality to "find" which was matched (as the minimum):
cluster::{x=&/'x}
cluster(data distances centroids)
1 0 0
1 0 0
1 0 0
1 0 0
1 0 0
...
In the above example, all of the first 5 rows belong to the first cluster with the centroids provided.
To collect the cluster identifier from the cluster arrays show above I
define a verb group
, which "finds" (?
) the first (*
) 1 in each
cluster:
group::{{*x?1}'cluster(x distances y)}
data group centroids
[0 0 0 0 0 0 0 0 0 0 0 0 ...
This alone isn't very interesting, especially in the above case, where
all of the leading elements belong to the first cluster. It becomes
more interesting when applied with an aggregation for each cluster
value. Because centroids
is itself an array, the rank (#
) of it
can be enumerated (!
):
!#centroids
[0 1 2]
Which, when combined with with a "find" (?
), results in an
aggregation of per-centroid grouping:
{(data group centroids)?x}'!#centroids
[[0 1 2 3 4 5 6 7 8 9 10 ...
[50 51 52 53 54 55 56 58 ...
[105 107 109 117 118 122 ...
Of course, the grouping is only an index position into the original
data array, so it makes sense to combine it with a "pick" (@
) to get
the actual data per-group:
data@{(data group centroids)?x}'!#centroids
[[[5.1 3.5 1.4 0.2] [4.9 3.0 1.4 0.2] [4.7 3.2 1.3 0.2] ...
[[7.0 3.2 4.7 1.4] [6.4 3.2 4.5 1.5] [6.9 3.1 4.9 1.5] ...
[[7.6 3.0 6.6 2.1] [7.3 2.9 6.3 1.8] [7.2 3.6 6.1 2.5] ...
The final piece to explain is probably the easiest, K-Means is a minimization problem, and in order to have a single "thing" to minimize, we need to take the mean of each grouping in order to finish a single step in the process. Mean is defined as the sum of an array divided by the number of elements in the array:
mean::{(+/x)%#x}
mean([0 1 2 3 4 5 6 7 8 9 10])
5.0
I've combined the last few pieces into a single verb means
to
simplify the interface, specifically around applying the mean to each
centroid group, and to make the final step as easy as possible:
means::{[data centroids center]; data::x; centroids::y;
center::data@{(data group centroids)?x}'!#centroids;
mean'center}
means(data;centroids)
+---------------------+---------------------+---------------------+----------------------+
| 5.00566037735849057 | 3.36037735849056604 | 1.56226415094339622 | 0.288679245283018868 |
| 6.15862068965517241 | 2.85632183908045977 | 4.79425287356321839 | 1.65057471264367816 |
| 7.54 | 3.15 | 6.39 | 2.09 |
+---------------------+---------------------+---------------------+----------------------+
K-Means works by iteratively re-evaluating the distance and adjusting
the centroids for the next iteration. The calculation is "done" when
the adjustments are sufficiently small, which is in effect a
fixed-point calculation. It turns out Klong has a fixed-point
operation, termed convergence (:~
). Using it requires a function
(verb) of one argument (a monad), which is the value to be re-supplied
after evaluating, to check for fixedness. In the case of means
, the
two arguments are "data", which does not change each time it is
evaluated, and "centroids" which do. In order to converge the means
,
I can partially apply the "data" argument and converge over subsequent
centroids:
means(data;):~data init k
In the above example, k
would be the desired number of centroids for
a set of data
. For the iris data set I get the following:
means(data;):~data init 3
+---------------------+---------------------+---------------------+---------------------+
| 6.85 | 3.07368421052631579 | 5.74210526315789474 | 2.07105263157894736 |
| 5.90161290322580645 | 2.74838709677419355 | 4.39354838709677419 | 1.43387096774193548 |
| 5.006 | 3.418 | 1.464 | 0.244 |
+---------------------+---------------------+---------------------+---------------------+
Of course, the above a still a bit abstract and so I couldn't resist visualizing the identified clusters:
First, from a two-dimensional projection, where the black circles are the raw data and the red dots are the identified centers:
And again with a three-dimensional projection to better visualize at least some of the extra dimensions:
I've already been asked "Wouldn't it be better if it could find the right number of clusters itself?", which is exactly what the elbow method achieves. There is also the "curse of dimensionality" to consider, which begs for an implementation of principal component analysis or similar.
readfile::{.fc(.ic(x));.r()}
data::readfile("iris.dat")
randint::{_x*.rn()}
init::{[data];data::x; {data@x}'{x;randint(#data)}'&y}
distance::{(+/((x-y)^2))^0.5}
distances::{[a];a::y; distance(;+a)'x}
cluster::{x=&/'x}
group::{{*x?1}'cluster(x distances y)}
mean::{(+/x)%#x}
:"means(DATA;):~DATA init K"
means::{[data centroids center]; data::x; centroids::y;
center::data@{(data group centroids)?x}'!#centroids;
mean'center}