[nolan@nprescott.com] $  cat weblog archive feed

Bike Share Data

2021-02-20

Table of Contents

Being veritably snowed in today I thought I might circle back through the publicly available bike-share data for the city of Pittsburgh and see if there was anything of interest to be gleaned.

When I wrote previously about gathering the data I got side-tracked with developing my own SQLite module and calculating distances. Here I try to put that information to use. The question outlined here are just things that seemed interesting and I was capable of querying for.

1 How many bikes and stations are there?

select count(distinct(bikeid)) bikes from rides;
bikes
488
select count(distinct(id)) stations from stations;
stations
108

2 What is the ratio of customers to subscribers?

There isn't too much to this, I think it would be more interesting if I knew how subscribers were counted. There is a tie-in with the city metro cards that allows for free rentals under 15 minutes.

select usertype, count(*) [count] from rides group by usertype;
usertype count
Customer 2751
Subscriber 4542

2.1 What is the break-down of short rides between usertypes?

  select usertype, count(*) [count]
    from rides
   where duration < (15*60)
group by usertype
usertype count
Customer 1041
Subscriber 3348

Based on these differences, where subscribers account for 62% of all rides but take 76% of these short trips I guess that the metro card holders are counted among subscribers rather than customers.

3 Average rides per-bike in Q1 2019?

select round(avg(numrides)) [average rides]
from (
    select bikeid, count(*) numrides
      from rides
  group by bikeid);
average rides
15.0

4 What is the median idle time between rides across all bikes?

I'm pretty sure I've worked out this query correctly. By taking the difference in time between rentals per-bike I count "idle time". Populating a temporary table just makes it easier to calculate the median.

create temporary table idletime (
    bikeid,
    seconds
);

insert into idletime
select bikeid,
       strftime('%s', starttime) - strftime('%s', lag(endtime) over
                                                  (partition by bikeid
                                                       order by endtime))
  from rides;

select (avg(seconds)/60)/60 hours
  from (select seconds
          from idletime
      order by seconds
         limit 2 - (select count(*) from idletime) % 2    -- odd 1, even 2
        offset (select (count(*) - 1) / 2
                  from idletime))
hours
30.1833333333333

5 Distance travelled per bike

This was easy but there is too much data for the raw results to be very insightful.

with [per bike distance] as (
  select bikeid, starttime, pd.distance
    from rides r
    join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id 
)
  select bikeid, round(sum(distance)) total
    from [per bike distance] 
group by bikeid 
order by total desc
limit 10;
bikeid total
70611 74711.0
70511 73499.0
70672 68515.0
70631 67937.0
70525 66937.0
70649 65583.0
70532 65553.0
70538 64952.0
70608 64374.0
70583 57860.0

5.1 summary information

create temporary table summary (
    bikeid,
    distance
);

with [per bike distance] as (
  select bikeid, starttime, pd.distance
    from rides r
    join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id 
) insert into summary
  select bikeid, sum(distance)
    from [per bike distance] 
group by bikeid;

select round(min(distance)) minimum, round(max(distance)) maximum, round(avg(distance)) average 
  from summary;

select round(avg(distance)) median
  from (select distance
          from summary
      order by distance
         limit 2 - (select count(*) from summary) % 2    -- odd 1, even 2
        offset (select (count(*) - 1) / 2
                  from summary))
minimum maximum average median
714.0 74711.0 22193.0 19921.0

6 Which are the most popular stations for circuits?

I refer to any trip that starts and ends at the same station as a circuit. This is probably best seen as a proxy for sightseeing locations.

  select name, count(*) circuits 
    from rides 
    join stations on from_id=stations.id 
   where from_id = to_id 
group by from_id 
order by circuits desc 
limit 10;
name circuits
33rd St and Penn Ave 103
S 27th St & Sidney St. (Southside Works) 101
Liberty Ave & Stanwix St 78
North Shore Trail & Ft Duquesne Bridge 67
S 18th St & Sidney St 60
Isabella St & Federal St (PNC Park) 36
First Ave & B St (T Station) 35
Third Ave & Wood St 32
S 12th St & E Carson St 30
Forbes Ave & Market Square 26

7 Which are the most unpopular stations?

It would be useful to know of extenuating circumstances here — were these installed during the quarter? Were they closed? It is impossible to say based on this dataset alone.

  select stations.name, count(*) [count]
    from stations 
    join rides on rides.from_id=stations.id or rides.to_id=stations.id 
group by stations.id 
order by [count] asc
   limit 10;
name count
Hamilton Ave & N Lang Ave 2
Bennett St & Eccrue Way 4
E Liberty Blvd & Larimer Ave 4
Hamilton Ave & N Dallas Ave 6
Fifth Ave & Miltenberger 8
Frankstown Ave & E Liberty Blvd 10
Hamilton Ave & Zodiac Way 11
Centre Ave & Kirkpatrick St 12
Forbes Ave & Murray Ave 12
Liberty Ave & Fisk St 18

8 What is station-use by day of the week?

I haven't decided what the best format for this would be. I think it could make sense as a kind of sparkline plot or general multiplot but there are so many stations that it would still be very visually dense.

  select from_id [departing station], strftime('%w', starttime) day, count(*) [count]
    from rides
group by [departing station], day
order by [departing station], day
   limit 14;
departing station day count
1000 0 51
1000 1 16
1000 2 28
1000 3 26
1000 4 28
1000 5 25
1000 6 44
1001 0 22
1001 1 33
1001 2 30
1001 3 44
1001 4 50
1001 5 31
1001 6 27

9 Bikes returned to near-by racks

I'm not totally satisfied with this query. I had hoped that it might be possible to identify when the source bike rack was at capacity but there doesn't seem to be enough information here to make such a determination.

select usertype, s1.name [from], s2.name [to], (duration/60) minutes, distance
  from rides r 
  join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id 
  join stations s1 on r.from_id=s1.id
  join stations s2 on r.to_id=s2.id
 where r.from_id <> r.to_id
   and distance < 300
   and duration > 600
 limit 10;
usertype from to minutes distance
Customer Liberty Ave & 6th St Forbes Ave & Market Square 31 145.239650442889
Customer Forbes Ave & Grant St Smithfield St & Fourth Ave 17 180.100642583674
Customer Third Ave & Wood St Smithfield St & Fourth Ave 59 187.374093661187
Customer Smithfield St & Fourth Ave Third Ave & Wood St 12 187.374093661187
Customer Tennyson Ave & Fifth Ave Schenley Dr & Forbes Ave (Schenley Plaza) 26 292.364498319696
Customer Forbes Ave & Market Square Third Ave & Wood St 15 229.868921089702
Subscriber Fort Duquesne Blvd & 7th 9th St & Penn Ave 10 248.06802267547
Subscriber Third Ave & Wood St Smithfield St & Fourth Ave 26 187.374093661187
Subscriber Wood St & Sixth Ave Liberty Ave & 6th St 13 200.711833366798
Customer Butler St & 48th St Butler St & Stanton Ave 34 286.217564216965