Being veritably snowed in today I thought I might circle back through the publicly available bike-share data for the city of Pittsburgh and see if there was anything of interest to be gleaned.
When I wrote previously about gathering the data I got side-tracked with developing my own SQLite module and calculating distances. Here I try to put that information to use. The question outlined here are just things that seemed interesting and I was capable of querying for.
select count(distinct(bikeid)) bikes from rides;
| bikes |
|---|
| 488 |
select count(distinct(id)) stations from stations;
| stations |
|---|
| 108 |
There isn't too much to this, I think it would be more interesting if I knew how subscribers were counted. There is a tie-in with the city metro cards that allows for free rentals under 15 minutes.
select usertype, count(*) [count] from rides group by usertype;
| usertype | count |
|---|---|
| Customer | 2751 |
| Subscriber | 4542 |
select usertype, count(*) [count]
from rides
where duration < (15*60)
group by usertype
| usertype | count |
|---|---|
| Customer | 1041 |
| Subscriber | 3348 |
Based on these differences, where subscribers account for 62% of all rides but take 76% of these short trips I guess that the metro card holders are counted among subscribers rather than customers.
select round(avg(numrides)) [average rides]
from (
select bikeid, count(*) numrides
from rides
group by bikeid);
| average rides |
|---|
| 15.0 |
I'm pretty sure I've worked out this query correctly. By taking the difference in time between rentals per-bike I count "idle time". Populating a temporary table just makes it easier to calculate the median.
create temporary table idletime (
bikeid,
seconds
);
insert into idletime
select bikeid,
strftime('%s', starttime) - strftime('%s', lag(endtime) over
(partition by bikeid
order by endtime))
from rides;
select (avg(seconds)/60)/60 hours
from (select seconds
from idletime
order by seconds
limit 2 - (select count(*) from idletime) % 2 -- odd 1, even 2
offset (select (count(*) - 1) / 2
from idletime))
| hours |
|---|
| 30.1833333333333 |
This was easy but there is too much data for the raw results to be very insightful.
with [per bike distance] as (
select bikeid, starttime, pd.distance
from rides r
join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id
)
select bikeid, round(sum(distance)) total
from [per bike distance]
group by bikeid
order by total desc
limit 10;
| bikeid | total |
|---|---|
| 70611 | 74711.0 |
| 70511 | 73499.0 |
| 70672 | 68515.0 |
| 70631 | 67937.0 |
| 70525 | 66937.0 |
| 70649 | 65583.0 |
| 70532 | 65553.0 |
| 70538 | 64952.0 |
| 70608 | 64374.0 |
| 70583 | 57860.0 |
create temporary table summary (
bikeid,
distance
);
with [per bike distance] as (
select bikeid, starttime, pd.distance
from rides r
join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id
) insert into summary
select bikeid, sum(distance)
from [per bike distance]
group by bikeid;
select round(min(distance)) minimum, round(max(distance)) maximum, round(avg(distance)) average
from summary;
select round(avg(distance)) median
from (select distance
from summary
order by distance
limit 2 - (select count(*) from summary) % 2 -- odd 1, even 2
offset (select (count(*) - 1) / 2
from summary))
| minimum | maximum | average | median |
|---|---|---|---|
| 714.0 | 74711.0 | 22193.0 | 19921.0 |
I refer to any trip that starts and ends at the same station as a circuit. This is probably best seen as a proxy for sightseeing locations.
select name, count(*) circuits
from rides
join stations on from_id=stations.id
where from_id = to_id
group by from_id
order by circuits desc
limit 10;
| name | circuits |
|---|---|
| 33rd St and Penn Ave | 103 |
| S 27th St & Sidney St. (Southside Works) | 101 |
| Liberty Ave & Stanwix St | 78 |
| North Shore Trail & Ft Duquesne Bridge | 67 |
| S 18th St & Sidney St | 60 |
| Isabella St & Federal St (PNC Park) | 36 |
| First Ave & B St (T Station) | 35 |
| Third Ave & Wood St | 32 |
| S 12th St & E Carson St | 30 |
| Forbes Ave & Market Square | 26 |
It would be useful to know of extenuating circumstances here — were these installed during the quarter? Were they closed? It is impossible to say based on this dataset alone.
select stations.name, count(*) [count]
from stations
join rides on rides.from_id=stations.id or rides.to_id=stations.id
group by stations.id
order by [count] asc
limit 10;
| name | count |
|---|---|
| Hamilton Ave & N Lang Ave | 2 |
| Bennett St & Eccrue Way | 4 |
| E Liberty Blvd & Larimer Ave | 4 |
| Hamilton Ave & N Dallas Ave | 6 |
| Fifth Ave & Miltenberger | 8 |
| Frankstown Ave & E Liberty Blvd | 10 |
| Hamilton Ave & Zodiac Way | 11 |
| Centre Ave & Kirkpatrick St | 12 |
| Forbes Ave & Murray Ave | 12 |
| Liberty Ave & Fisk St | 18 |
I haven't decided what the best format for this would be. I think it could make sense as a kind of sparkline plot or general multiplot but there are so many stations that it would still be very visually dense.
select from_id [departing station], strftime('%w', starttime) day, count(*) [count]
from rides
group by [departing station], day
order by [departing station], day
limit 14;
| departing station | day | count |
|---|---|---|
| 1000 | 0 | 51 |
| 1000 | 1 | 16 |
| 1000 | 2 | 28 |
| 1000 | 3 | 26 |
| 1000 | 4 | 28 |
| 1000 | 5 | 25 |
| 1000 | 6 | 44 |
| 1001 | 0 | 22 |
| 1001 | 1 | 33 |
| 1001 | 2 | 30 |
| 1001 | 3 | 44 |
| 1001 | 4 | 50 |
| 1001 | 5 | 31 |
| 1001 | 6 | 27 |
I'm not totally satisfied with this query. I had hoped that it might be possible to identify when the source bike rack was at capacity but there doesn't seem to be enough information here to make such a determination.
select usertype, s1.name [from], s2.name [to], (duration/60) minutes, distance from rides r join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id join stations s1 on r.from_id=s1.id join stations s2 on r.to_id=s2.id where r.from_id <> r.to_id and distance < 300 and duration > 600 limit 10;
| usertype | from | to | minutes | distance |
|---|---|---|---|---|
| Customer | Liberty Ave & 6th St | Forbes Ave & Market Square | 31 | 145.239650442889 |
| Customer | Forbes Ave & Grant St | Smithfield St & Fourth Ave | 17 | 180.100642583674 |
| Customer | Third Ave & Wood St | Smithfield St & Fourth Ave | 59 | 187.374093661187 |
| Customer | Smithfield St & Fourth Ave | Third Ave & Wood St | 12 | 187.374093661187 |
| Customer | Tennyson Ave & Fifth Ave | Schenley Dr & Forbes Ave (Schenley Plaza) | 26 | 292.364498319696 |
| Customer | Forbes Ave & Market Square | Third Ave & Wood St | 15 | 229.868921089702 |
| Subscriber | Fort Duquesne Blvd & 7th | 9th St & Penn Ave | 10 | 248.06802267547 |
| Subscriber | Third Ave & Wood St | Smithfield St & Fourth Ave | 26 | 187.374093661187 |
| Subscriber | Wood St & Sixth Ave | Liberty Ave & 6th St | 13 | 200.711833366798 |
| Customer | Butler St & 48th St | Butler St & Stanton Ave | 34 | 286.217564216965 |