Being veritably snowed in today I thought I might circle back through the publicly available bike-share data for the city of Pittsburgh and see if there was anything of interest to be gleaned.
When I wrote previously about gathering the data I got side-tracked with developing my own SQLite module and calculating distances. Here I try to put that information to use. The question outlined here are just things that seemed interesting and I was capable of querying for.
select count(distinct(bikeid)) bikes from rides;
bikes |
---|
488 |
select count(distinct(id)) stations from stations;
stations |
---|
108 |
There isn't too much to this, I think it would be more interesting if I knew how subscribers were counted. There is a tie-in with the city metro cards that allows for free rentals under 15 minutes.
select usertype, count(*) [count] from rides group by usertype;
usertype | count |
---|---|
Customer | 2751 |
Subscriber | 4542 |
select usertype, count(*) [count] from rides where duration < (15*60) group by usertype
usertype | count |
---|---|
Customer | 1041 |
Subscriber | 3348 |
Based on these differences, where subscribers account for 62% of all rides but take 76% of these short trips I guess that the metro card holders are counted among subscribers rather than customers.
select round(avg(numrides)) [average rides] from ( select bikeid, count(*) numrides from rides group by bikeid);
average rides |
---|
15.0 |
I'm pretty sure I've worked out this query correctly. By taking the difference in time between rentals per-bike I count "idle time". Populating a temporary table just makes it easier to calculate the median.
create temporary table idletime ( bikeid, seconds ); insert into idletime select bikeid, strftime('%s', starttime) - strftime('%s', lag(endtime) over (partition by bikeid order by endtime)) from rides; select (avg(seconds)/60)/60 hours from (select seconds from idletime order by seconds limit 2 - (select count(*) from idletime) % 2 -- odd 1, even 2 offset (select (count(*) - 1) / 2 from idletime))
hours |
---|
30.1833333333333 |
This was easy but there is too much data for the raw results to be very insightful.
with [per bike distance] as ( select bikeid, starttime, pd.distance from rides r join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id ) select bikeid, round(sum(distance)) total from [per bike distance] group by bikeid order by total desc limit 10;
bikeid | total |
---|---|
70611 | 74711.0 |
70511 | 73499.0 |
70672 | 68515.0 |
70631 | 67937.0 |
70525 | 66937.0 |
70649 | 65583.0 |
70532 | 65553.0 |
70538 | 64952.0 |
70608 | 64374.0 |
70583 | 57860.0 |
create temporary table summary ( bikeid, distance ); with [per bike distance] as ( select bikeid, starttime, pd.distance from rides r join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id ) insert into summary select bikeid, sum(distance) from [per bike distance] group by bikeid; select round(min(distance)) minimum, round(max(distance)) maximum, round(avg(distance)) average from summary; select round(avg(distance)) median from (select distance from summary order by distance limit 2 - (select count(*) from summary) % 2 -- odd 1, even 2 offset (select (count(*) - 1) / 2 from summary))
minimum | maximum | average | median |
---|---|---|---|
714.0 | 74711.0 | 22193.0 | 19921.0 |
I refer to any trip that starts and ends at the same station as a circuit. This is probably best seen as a proxy for sightseeing locations.
select name, count(*) circuits from rides join stations on from_id=stations.id where from_id = to_id group by from_id order by circuits desc limit 10;
name | circuits |
---|---|
33rd St and Penn Ave | 103 |
S 27th St & Sidney St. (Southside Works) | 101 |
Liberty Ave & Stanwix St | 78 |
North Shore Trail & Ft Duquesne Bridge | 67 |
S 18th St & Sidney St | 60 |
Isabella St & Federal St (PNC Park) | 36 |
First Ave & B St (T Station) | 35 |
Third Ave & Wood St | 32 |
S 12th St & E Carson St | 30 |
Forbes Ave & Market Square | 26 |
It would be useful to know of extenuating circumstances here — were these installed during the quarter? Were they closed? It is impossible to say based on this dataset alone.
select stations.name, count(*) [count] from stations join rides on rides.from_id=stations.id or rides.to_id=stations.id group by stations.id order by [count] asc limit 10;
name | count |
---|---|
Hamilton Ave & N Lang Ave | 2 |
Bennett St & Eccrue Way | 4 |
E Liberty Blvd & Larimer Ave | 4 |
Hamilton Ave & N Dallas Ave | 6 |
Fifth Ave & Miltenberger | 8 |
Frankstown Ave & E Liberty Blvd | 10 |
Hamilton Ave & Zodiac Way | 11 |
Centre Ave & Kirkpatrick St | 12 |
Forbes Ave & Murray Ave | 12 |
Liberty Ave & Fisk St | 18 |
I haven't decided what the best format for this would be. I think it could make sense as a kind of sparkline plot or general multiplot but there are so many stations that it would still be very visually dense.
select from_id [departing station], strftime('%w', starttime) day, count(*) [count] from rides group by [departing station], day order by [departing station], day limit 14;
departing station | day | count |
---|---|---|
1000 | 0 | 51 |
1000 | 1 | 16 |
1000 | 2 | 28 |
1000 | 3 | 26 |
1000 | 4 | 28 |
1000 | 5 | 25 |
1000 | 6 | 44 |
1001 | 0 | 22 |
1001 | 1 | 33 |
1001 | 2 | 30 |
1001 | 3 | 44 |
1001 | 4 | 50 |
1001 | 5 | 31 |
1001 | 6 | 27 |
I'm not totally satisfied with this query. I had hoped that it might be possible to identify when the source bike rack was at capacity but there doesn't seem to be enough information here to make such a determination.
select usertype, s1.name [from], s2.name [to], (duration/60) minutes, distance from rides r join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id join stations s1 on r.from_id=s1.id join stations s2 on r.to_id=s2.id where r.from_id <> r.to_id and distance < 300 and duration > 600 limit 10;
usertype | from | to | minutes | distance |
---|---|---|---|---|
Customer | Liberty Ave & 6th St | Forbes Ave & Market Square | 31 | 145.239650442889 |
Customer | Forbes Ave & Grant St | Smithfield St & Fourth Ave | 17 | 180.100642583674 |
Customer | Third Ave & Wood St | Smithfield St & Fourth Ave | 59 | 187.374093661187 |
Customer | Smithfield St & Fourth Ave | Third Ave & Wood St | 12 | 187.374093661187 |
Customer | Tennyson Ave & Fifth Ave | Schenley Dr & Forbes Ave (Schenley Plaza) | 26 | 292.364498319696 |
Customer | Forbes Ave & Market Square | Third Ave & Wood St | 15 | 229.868921089702 |
Subscriber | Fort Duquesne Blvd & 7th | 9th St & Penn Ave | 10 | 248.06802267547 |
Subscriber | Third Ave & Wood St | Smithfield St & Fourth Ave | 26 | 187.374093661187 |
Subscriber | Wood St & Sixth Ave | Liberty Ave & 6th St | 13 | 200.711833366798 |
Customer | Butler St & 48th St | Butler St & Stanton Ave | 34 | 286.217564216965 |