Being veritably snowed in today I thought I might circle back through the publicly available bike-share data for the city of Pittsburgh and see if there was anything of interest to be gleaned.
When I wrote previously about gathering the data I got side-tracked with developing my own SQLite module and calculating distances. Here I try to put that information to use. The question outlined here are just things that seemed interesting and I was capable of querying for.
select count(distinct(bikeid)) bikes from rides;
select count(distinct(id)) stations from stations;
There isn't too much to this, I think it would be more interesting if I knew how subscribers were counted. There is a tie-in with the city metro cards that allows for free rentals under 15 minutes.
select usertype, count(*) [count] from rides group by usertype;
select usertype, count(*) [count] from rides where duration < (15*60) group by usertype
Based on these differences, where subscribers account for 62% of all rides but take 76% of these short trips I guess that the metro card holders are counted among subscribers rather than customers.
select round(avg(numrides)) [average rides] from ( select bikeid, count(*) numrides from rides group by bikeid);
I'm pretty sure I've worked out this query correctly. By taking the difference in time between rentals per-bike I count "idle time". Populating a temporary table just makes it easier to calculate the median.
create temporary table idletime ( bikeid, seconds ); insert into idletime select bikeid, strftime('%s', starttime) - strftime('%s', lag(endtime) over (partition by bikeid order by endtime)) from rides; select (avg(seconds)/60)/60 hours from (select seconds from idletime order by seconds limit 2 - (select count(*) from idletime) % 2 -- odd 1, even 2 offset (select (count(*) - 1) / 2 from idletime))
This was easy but there is too much data for the raw results to be very insightful.
with [per bike distance] as ( select bikeid, starttime, pd.distance from rides r join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id ) select bikeid, round(sum(distance)) total from [per bike distance] group by bikeid order by total desc limit 10;
create temporary table summary ( bikeid, distance ); with [per bike distance] as ( select bikeid, starttime, pd.distance from rides r join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id ) insert into summary select bikeid, sum(distance) from [per bike distance] group by bikeid; select round(min(distance)) minimum, round(max(distance)) maximum, round(avg(distance)) average from summary; select round(avg(distance)) median from (select distance from summary order by distance limit 2 - (select count(*) from summary) % 2 -- odd 1, even 2 offset (select (count(*) - 1) / 2 from summary))
I refer to any trip that starts and ends at the same station as a circuit. This is probably best seen as a proxy for sightseeing locations.
select name, count(*) circuits from rides join stations on from_id=stations.id where from_id = to_id group by from_id order by circuits desc limit 10;
|33rd St and Penn Ave||103|
|S 27th St & Sidney St. (Southside Works)||101|
|Liberty Ave & Stanwix St||78|
|North Shore Trail & Ft Duquesne Bridge||67|
|S 18th St & Sidney St||60|
|Isabella St & Federal St (PNC Park)||36|
|First Ave & B St (T Station)||35|
|Third Ave & Wood St||32|
|S 12th St & E Carson St||30|
|Forbes Ave & Market Square||26|
It would be useful to know of extenuating circumstances here — were these installed during the quarter? Were they closed? It is impossible to say based on this dataset alone.
select stations.name, count(*) [count] from stations join rides on rides.from_id=stations.id or rides.to_id=stations.id group by stations.id order by [count] asc limit 10;
|Hamilton Ave & N Lang Ave||2|
|Bennett St & Eccrue Way||4|
|E Liberty Blvd & Larimer Ave||4|
|Hamilton Ave & N Dallas Ave||6|
|Fifth Ave & Miltenberger||8|
|Frankstown Ave & E Liberty Blvd||10|
|Hamilton Ave & Zodiac Way||11|
|Centre Ave & Kirkpatrick St||12|
|Forbes Ave & Murray Ave||12|
|Liberty Ave & Fisk St||18|
I haven't decided what the best format for this would be. I think it could make sense as a kind of sparkline plot or general multiplot but there are so many stations that it would still be very visually dense.
select from_id [departing station], strftime('%w', starttime) day, count(*) [count] from rides group by [departing station], day order by [departing station], day limit 14;
I'm not totally satisfied with this query. I had hoped that it might be possible to identify when the source bike rack was at capacity but there doesn't seem to be enough information here to make such a determination.
select usertype, s1.name [from], s2.name [to], (duration/60) minutes, distance from rides r join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id join stations s1 on r.from_id=s1.id join stations s2 on r.to_id=s2.id where r.from_id <> r.to_id and distance < 300 and duration > 600 limit 10;
|Customer||Liberty Ave & 6th St||Forbes Ave & Market Square||31||145.239650442889|
|Customer||Forbes Ave & Grant St||Smithfield St & Fourth Ave||17||180.100642583674|
|Customer||Third Ave & Wood St||Smithfield St & Fourth Ave||59||187.374093661187|
|Customer||Smithfield St & Fourth Ave||Third Ave & Wood St||12||187.374093661187|
|Customer||Tennyson Ave & Fifth Ave||Schenley Dr & Forbes Ave (Schenley Plaza)||26||292.364498319696|
|Customer||Forbes Ave & Market Square||Third Ave & Wood St||15||229.868921089702|
|Subscriber||Fort Duquesne Blvd & 7th||9th St & Penn Ave||10||248.06802267547|
|Subscriber||Third Ave & Wood St||Smithfield St & Fourth Ave||26||187.374093661187|
|Subscriber||Wood St & Sixth Ave||Liberty Ave & 6th St||13||200.711833366798|
|Customer||Butler St & 48th St||Butler St & Stanton Ave||34||286.217564216965|