[nolan@nprescott.com] $>  cat blog archive feed

SQL Prototypes

2018-09-18

I had an interesting problem to work on this afternoon that lead a few features of Postgres I haven't previously used. It was an interesting exercise in some quick prototyping in pure SQL.

The Problem

I'm working on a way to visualize groups of key-value pairs that are both dynamic in the number of groups possible, as well as the possible keys and values. It is a bit of an interesting exercise in developing queries for patterns rather than specific information.

Table Structures

Generally speaking, I'm thinking of two tables, a names table:

id name
1 alpha
2 bravo
3 charlie
4 delta
create table names (
       id serial primary key,
       name text not null
);

and a table of key-value pairs, containing a reference to the names table:

key value name_id
a foo 1
a foo 2
b bar 2
c qux 3
c qux 4
create table keyValues (
       key text not null,
       value text not null,
       name_id integer references names(id)
);

The Goal

What I'd like is to collect discrete "groups" of keys and values, so in the above table the groups would be:

group one group two group three
alpha bravo charlie, delta

So while a: foo is referenced by alpha and bravo, bravo has an additional key-value (b: bar) that makes it a separate group to itself. The only group with more than one element is the third, where charlie and delta share a set of key-value pairs (in this case the single pair c: qux).

The Approach

Because there aren't any real constraints in the data on what is or isn't a valid key or value, I'd like to combine the two pieces of information into one composite piece.

  select name_id, (key || ': ' || value) kv_string
    from example
order by key, value

Here I'm just concatenating the two fields together to create:

name_id kv_string
1 a: foo
2 a: foo
2 b: bar
3 c: qux
4 c: qux

That becomes useful when used in conjunction with the function string_agg which is an aggregation into a string for multiple rows (where concatenating was useful to "aggregate" multiple columns). Below I'm using a placeholder "temp_table" that will be clarified below, but assume it has the information in the table above:

  select name_id, string_agg(kv_string, ', ') key_values
    from temp_table
group by name_id
name_id key_values
3 c: qux
4 c: qux
2 a: foo, b: bar
1 a: foo

Each name_id now has a single column of all it's key-value pairs. All that remains is grouping those names with a shared key_value field. For this, I found array_agg, which is like string_agg, but it aggregates into an array (which Postgres can handle in a variety of ways).

  select array_agg(id) groups 
    from temp_table_2
group by key_values
groups
{1}
{2}
{3,4}

Putting It Together

All of those references to "temp_tables" might be annoying if we had to actually create those tables ourselves, but luckily, Postgres supports common table expressions (which I've mentioned before). The separate queries can be combined like this:

with concatenated_key_values as (
  select id, (key || ' ' || value) kv_string
    from example
    order by key, value
),  combined_columns as (
  select id, string_agg(kv_string, ', ') key_values
    from concatenated_key_values
    group by id
) select array_agg(id) groups 
    from combined_columns
group by string_agg;

A Touch of Polish

As a final touch, to bring the end result in-line with what I outlined as the goal, it'd be nice to enumerate the groups and use the name, rather than the name_id.

with concatenated_key_values as (
  select id, (key || ' ' || value) kv_string
    from example
order by key, value
), combined_rows as (
  select id, string_agg(kv_string, ', ') key_values
    from concatenated_key_values
group by id
) select row_number() over () group_number, 
         array_agg(name) group_members
    from combined_rows
    join names using(id)
group by key_values;
group_number group_members
1 {alpha}
2 {bravo}
3 {charlie,delta}

Neat Things

This ended up being an interesting exercise is some of PostgreSQL's aggregate functions, it took a while for things to click with me that I was doing aggregations in two "directions", first across columns and then across rows. Having the ability to omit an order by clause from the row_number function was an interesting twist, but because I don't actually care which number is assigned to which "group", it does make some sense. Because I am developing the tables at the same time as the queries, and before the application it is a pleasant surprise how easy it is to refine these sorts of prototypes directly in Postgres.

[nolan@nprescott.com] $> █