Bike Share Data

2021-02-20

1. How many bikes and stations are there?
2. What is the ratio of customers to subscribers?
- 2.1. What is the break-down of short rides between usertypes?
3. Average rides per-bike in Q1 2019?
4. What is the median idle time between rides across all bikes?
5. Distance travelled per bike
- 5.1. summary information
6. Which are the most popular stations for circuits?
7. Which are the most unpopular stations?
8. What is station-use by day of the week?
9. Bikes returned to near-by racks

Being veritably snowed in today I thought I might circle back through the publicly available bike-share data for the city of Pittsburgh and see if there was anything of interest to be gleaned.

When I wrote previously about gathering the data I got side-tracked with developing my own SQLite module and calculating distances. Here I try to put that information to use. The question outlined here are just things that seemed interesting and I was capable of querying for.

1 How many bikes and stations are there?

select count(distinct(bikeid)) bikes from rides;

bikes
488

select count(distinct(id)) stations from stations;

stations
108

2 What is the ratio of customers to subscribers?

There isn't too much to this, I think it would be more interesting if I knew how subscribers were counted. There is a tie-in with the city metro cards that allows for free rentals under 15 minutes.

select usertype, count(*) [count] from rides group by usertype;

usertype	count
Customer	2751
Subscriber	4542

2.1 What is the break-down of short rides between usertypes?

  select usertype, count(*) [count]
    from rides
   where duration < (15*60)
group by usertype

usertype	count
Customer	1041
Subscriber	3348

Based on these differences, where subscribers account for 62% of all rides but take 76% of these short trips I guess that the metro card holders are counted among subscribers rather than customers.

3 Average rides per-bike in Q1 2019?

select round(avg(numrides)) [average rides]
from (
    select bikeid, count(*) numrides
      from rides
  group by bikeid);

average rides
15.0

4 What is the median idle time between rides across all bikes?

I'm pretty sure I've worked out this query correctly. By taking the difference in time between rentals per-bike I count "idle time". Populating a temporary table just makes it easier to calculate the median.

create temporary table idletime (
    bikeid,
    seconds
);

insert into idletime
select bikeid,
       strftime('%s', starttime) - strftime('%s', lag(endtime) over
                                                  (partition by bikeid
                                                       order by endtime))
  from rides;

select (avg(seconds)/60)/60 hours
  from (select seconds
          from idletime
      order by seconds
         limit 2 - (select count(*) from idletime) % 2    -- odd 1, even 2
        offset (select (count(*) - 1) / 2
                  from idletime))

hours
30.1833333333333

5 Distance travelled per bike

This was easy but there is too much data for the raw results to be very insightful.

with [per bike distance] as (
  select bikeid, starttime, pd.distance
    from rides r
    join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id 
)
  select bikeid, round(sum(distance)) total
    from [per bike distance] 
group by bikeid 
order by total desc
limit 10;

bikeid	total
70611	74711.0
70511	73499.0
70672	68515.0
70631	67937.0
70525	66937.0
70649	65583.0
70532	65553.0
70538	64952.0
70608	64374.0
70583	57860.0

5.1 summary information

create temporary table summary (
    bikeid,
    distance
);

with [per bike distance] as (
  select bikeid, starttime, pd.distance
    from rides r
    join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id 
) insert into summary
  select bikeid, sum(distance)
    from [per bike distance] 
group by bikeid;

select round(min(distance)) minimum, round(max(distance)) maximum, round(avg(distance)) average 
  from summary;

select round(avg(distance)) median
  from (select distance
          from summary
      order by distance
         limit 2 - (select count(*) from summary) % 2    -- odd 1, even 2
        offset (select (count(*) - 1) / 2
                  from summary))

minimum	maximum	average	median
714.0	74711.0	22193.0	19921.0

6 Which are the most popular stations for circuits?

I refer to any trip that starts and ends at the same station as a circuit. This is probably best seen as a proxy for sightseeing locations.

  select name, count(*) circuits 
    from rides 
    join stations on from_id=stations.id 
   where from_id = to_id 
group by from_id 
order by circuits desc 
limit 10;

name	circuits
33rd St and Penn Ave	103
S 27th St & Sidney St. (Southside Works)	101
Liberty Ave & Stanwix St	78
North Shore Trail & Ft Duquesne Bridge	67
S 18th St & Sidney St	60
Isabella St & Federal St (PNC Park)	36
First Ave & B St (T Station)	35
Third Ave & Wood St	32
S 12th St & E Carson St	30
Forbes Ave & Market Square	26

7 Which are the most unpopular stations?

It would be useful to know of extenuating circumstances here — were these installed during the quarter? Were they closed? It is impossible to say based on this dataset alone.

  select stations.name, count(*) [count]
    from stations 
    join rides on rides.from_id=stations.id or rides.to_id=stations.id 
group by stations.id 
order by [count] asc
   limit 10;

name	count
Hamilton Ave & N Lang Ave	2
Bennett St & Eccrue Way	4
E Liberty Blvd & Larimer Ave	4
Hamilton Ave & N Dallas Ave	6
Fifth Ave & Miltenberger	8
Frankstown Ave & E Liberty Blvd	10
Hamilton Ave & Zodiac Way	11
Centre Ave & Kirkpatrick St	12
Forbes Ave & Murray Ave	12
Liberty Ave & Fisk St	18

8 What is station-use by day of the week?

I haven't decided what the best format for this would be. I think it could make sense as a kind of sparkline plot or general multiplot but there are so many stations that it would still be very visually dense.

  select from_id [departing station], strftime('%w', starttime) day, count(*) [count]
    from rides
group by [departing station], day
order by [departing station], day
   limit 14;

departing station	day	count
1000	0	51
1000	1	16
1000	2	28
1000	3	26
1000	4	28
1000	5	25
1000	6	44
1001	0	22
1001	1	33
1001	2	30
1001	3	44
1001	4	50
1001	5	31
1001	6	27

9 Bikes returned to near-by racks

I'm not totally satisfied with this query. I had hoped that it might be possible to identify when the source bike rack was at capacity but there doesn't seem to be enough information here to make such a determination.

select usertype, s1.name [from], s2.name [to], (duration/60) minutes, distance
  from rides r 
  join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id 
  join stations s1 on r.from_id=s1.id
  join stations s2 on r.to_id=s2.id
 where r.from_id <> r.to_id
   and distance < 300
   and duration > 600
 limit 10;

usertype	from	to	minutes	distance
Customer	Liberty Ave & 6th St	Forbes Ave & Market Square	31	145.239650442889
Customer	Forbes Ave & Grant St	Smithfield St & Fourth Ave	17	180.100642583674
Customer	Third Ave & Wood St	Smithfield St & Fourth Ave	59	187.374093661187
Customer	Smithfield St & Fourth Ave	Third Ave & Wood St	12	187.374093661187
Customer	Tennyson Ave & Fifth Ave	Schenley Dr & Forbes Ave (Schenley Plaza)	26	292.364498319696
Customer	Forbes Ave & Market Square	Third Ave & Wood St	15	229.868921089702
Subscriber	Fort Duquesne Blvd & 7th	9th St & Penn Ave	10	248.06802267547
Subscriber	Third Ave & Wood St	Smithfield St & Fourth Ave	26	187.374093661187
Subscriber	Wood St & Sixth Ave	Liberty Ave & 6th St	13	200.711833366798
Customer	Butler St & 48th St	Butler St & Stanton Ave	34	286.217564216965