I previously delved into using my own loadable module to circumvent the lack of user defined functions in SQLite. A more typical, relational solution occurred to me today and I thought I should write it down before I forget.
I started with the Haversine formula, populated here with sample data:
select acos(
sin(radians(40.441326)) * -- latitude 1
sin(radians(40.440877)) + -- latitude 2
cos(radians(40.441326)) * -- latitude 1
cos(radians(40.440877)) * -- latitude 2
cos(radians(-80.00308) - -- longitude 2
radians(-80.004679)) -- longitude 1
) * 6371000;
I went in search of a user-defined function in order to package up the slightly complicated query that I did not want to use directly (and repeatedly) in the analysis I was interested in.
I wrote a function sdist
for "spherical distance"
(admittedly not a very good name) inside of a loadable module
(shared C libary) that took as parameters the two latitude and
longitude pairs.
I have no excuse for why it didn't occur to me the first time I wrote all this but there is an obvious relational database construct for this scenario: a view.
create view distances (from_id, to_id, distance) as
select s1.id, s2.id, acos(
sin(radians(s1.latitude)) *
sin(radians(s2.latitude)) +
cos(radians(s1.latitude)) *
cos(radians(s2.latitude)) *
cos(radians(s2.longitude) -
radians(s1.longitude))
) * 6371000
from stations s1, stations s2;
I think this is a much cleaner solution. Not only does it maintain the entire solution within plain SQL and the database directly; it still provides the ability to precompute a look-up table for each possible pairing:
create table precomputed_distance as
select * from distances;
Doing this got me thinking more about a book I read, A Curious Moon. In it the author describes the kind of "good hygiene" to be practiced with owning and operating a database. I figured I could better document the entire process that led up to the actual analysis by producing a single SQL (well, SQLite) document that manages the ingestion and munging of data:
.import --csv 2019Q1_stations.csv raw_stations_2019Q1
.import --csv 2019Q1_rentals.csv raw_rentals_2019Q1
create table stations (
id integer,
name string,
racks integer,
latitude float,
longitude float,
primary key (id)
);
insert into stations select * from raw_stations_2019Q1;
-- there is one record where a latitude is incorrectly negative, let's
-- fix it:
update stations
set latitude = abs(latitude)
where latitude < 0;
create table rides (
id integer,
starttime datetime,
endtime datetime,
bikeid integer,
duration integer,
from_id integer,
to_id integer,
usertype string,
foreign key(from_id) references station(id),
foreign key(to_id) references station(id),
primary key(id)
);
insert into rides
select "Trip id",
"Starttime",
"Stoptime",
"Bikeid",
"Tripduration",
"From station id",
"To station id",
"Usertype"
from raw_rentals_2019Q1;
create view distances (from_id, to_id, distance) as
select s1.id, s2.id, acos(
sin(radians(s1.latitude)) *
sin(radians(s2.latitude)) +
cos(radians(s1.latitude)) *
cos(radians(s2.latitude)) *
cos(radians(s2.longitude) -
radians(s1.longitude))
) * 6371000
from stations s1, stations s2;