indexpost archiveatom feed syndication feed icon

A Minor SQL Epiphany

2021-07-15

I previously delved into using my own loadable module to circumvent the lack of user defined functions in SQLite. A more typical, relational solution occurred to me today and I thought I should write it down before I forget.

The Problem

I started with the Haversine formula, populated here with sample data:

select acos(
    sin(radians(40.441326)) * -- latitude 1
    sin(radians(40.440877)) + -- latitude 2
    cos(radians(40.441326)) * -- latitude 1
    cos(radians(40.440877)) * -- latitude 2
    cos(radians(-80.00308) -  -- longitude 2
        radians(-80.004679))  -- longitude 1
    ) * 6371000;

I went in search of a user-defined function in order to package up the slightly complicated query that I did not want to use directly (and repeatedly) in the analysis I was interested in.

Previous Solution

I wrote a function sdist for "spherical distance" (admittedly not a very good name) inside of a loadable module (shared C libary) that took as parameters the two latitude and longitude pairs.

New Solution

I have no excuse for why it didn't occur to me the first time I wrote all this but there is an obvious relational database construct for this scenario: a view.

create view distances (from_id, to_id, distance) as

select s1.id, s2.id, acos(
    sin(radians(s1.latitude)) *
    sin(radians(s2.latitude)) +
    cos(radians(s1.latitude)) *
    cos(radians(s2.latitude)) *
    cos(radians(s2.longitude) -
        radians(s1.longitude))
) * 6371000
from stations s1, stations s2;

I think this is a much cleaner solution. Not only does it maintain the entire solution within plain SQL and the database directly; it still provides the ability to precompute a look-up table for each possible pairing:

create table precomputed_distance as
select * from distances;

Nice and Simple

Doing this got me thinking more about a book I read, A Curious Moon. In it the author describes the kind of "good hygiene" to be practiced with owning and operating a database. I figured I could better document the entire process that led up to the actual analysis by producing a single SQL (well, SQLite) document that manages the ingestion and munging of data:

.import --csv 2019Q1_stations.csv raw_stations_2019Q1
.import --csv 2019Q1_rentals.csv raw_rentals_2019Q1

create table stations (
  id integer,
  name string,
  racks integer,
  latitude float,
  longitude float,
  primary key (id)
);

insert into stations select * from raw_stations_2019Q1;

-- there is one record where a latitude is incorrectly negative, let's
-- fix it:
update stations
   set latitude = abs(latitude)
 where latitude < 0;

create table rides (
  id integer,
  starttime datetime,
  endtime datetime,
  bikeid integer,
  duration integer,
  from_id integer,
  to_id integer,
  usertype string,
  foreign key(from_id) references station(id),
  foreign key(to_id) references station(id),
  primary key(id)
);

insert into rides
select "Trip id",
       "Starttime",
       "Stoptime",
       "Bikeid",
       "Tripduration",
       "From station id",
       "To station id",
       "Usertype"
  from raw_rentals_2019Q1;

create view distances (from_id, to_id, distance) as

select s1.id, s2.id, acos(
    sin(radians(s1.latitude)) *
    sin(radians(s2.latitude)) +
    cos(radians(s1.latitude)) *
    cos(radians(s2.latitude)) *
    cos(radians(s2.longitude) -
        radians(s1.longitude))
) * 6371000
from stations s1, stations s2;