A Minor SQL Epiphany2021-07-15
I previously delved into using my own loadable module to circumvent the lack of user defined functions in SQLite. A more typical, relational solution occurred to me today and I thought I should write it down before I forget.
I started with the Haversine formula, populated here with sample data:
select acos( sin(radians(40.441326)) * -- latitude 1 sin(radians(40.440877)) + -- latitude 2 cos(radians(40.441326)) * -- latitude 1 cos(radians(40.440877)) * -- latitude 2 cos(radians(-80.00308) - -- longitude 2 radians(-80.004679)) -- longitude 1 ) * 6371000;
I went in search of a user-defined function in order to package up the slightly complicated query that I did not want to use directly (and repeatedly) in the analysis I was interested in.
I wrote a function
sdist for "spherical distance"
(admittedly not a very good name) inside of a loadable module
(shared C libary) that took as parameters the two latitude and
I have no excuse for why it didn't occur to me the first time I wrote all this but there is an obvious relational database construct for this scenario: a view.
create view distances (from_id, to_id, distance) as select s1.id, s2.id, acos( sin(radians(s1.latitude)) * sin(radians(s2.latitude)) + cos(radians(s1.latitude)) * cos(radians(s2.latitude)) * cos(radians(s2.longitude) - radians(s1.longitude)) ) * 6371000 from stations s1, stations s2;
I think this is a much cleaner solution. Not only does it maintain the entire solution within plain SQL and the database directly; it still provides the ability to precompute a look-up table for each possible pairing:
create table precomputed_distance as select * from distances;
Nice and Simple
Doing this got me thinking more about a book I read, A Curious Moon. In it the author describes the kind of "good hygiene" to be practiced with owning and operating a database. I figured I could better document the entire process that led up to the actual analysis by producing a single SQL (well, SQLite) document that manages the ingestion and munging of data:
.import --csv 2019Q1_stations.csv raw_stations_2019Q1 .import --csv 2019Q1_rentals.csv raw_rentals_2019Q1 create table stations ( id integer, name string, racks integer, latitude float, longitude float, primary key (id) ); insert into stations select * from raw_stations_2019Q1; -- there is one record where a latitude is incorrectly negative, let's -- fix it: update stations set latitude = abs(latitude) where latitude < 0; create table rides ( id integer, starttime datetime, endtime datetime, bikeid integer, duration integer, from_id integer, to_id integer, usertype string, foreign key(from_id) references station(id), foreign key(to_id) references station(id), primary key(id) ); insert into rides select "Trip id", "Starttime", "Stoptime", "Bikeid", "Tripduration", "From station id", "To station id", "Usertype" from raw_rentals_2019Q1; create view distances (from_id, to_id, distance) as select s1.id, s2.id, acos( sin(radians(s1.latitude)) * sin(radians(s2.latitude)) + cos(radians(s1.latitude)) * cos(radians(s2.latitude)) * cos(radians(s2.longitude) - radians(s1.longitude)) ) * 6371000 from stations s1, stations s2;