indexpost archiveatom feed

# Bike Share Data

2021-02-20

Being veritably snowed in today I thought I might circle back through the publicly available bike-share data for the city of Pittsburgh and see if there was anything of interest to be gleaned.

When I wrote previously about gathering the data I got side-tracked with developing my own SQLite module and calculating distances. Here I try to put that information to use. The question outlined here are just things that seemed interesting and I was capable of querying for.

## 1 How many bikes and stations are there?

```select count(distinct(bikeid)) bikes from rides;
```
bikes
488
```select count(distinct(id)) stations from stations;
```
stations
108

## 2 What is the ratio of customers to subscribers?

There isn't too much to this, I think it would be more interesting if I knew how subscribers were counted. There is a tie-in with the city metro cards that allows for free rentals under 15 minutes.

```select usertype, count(*) [count] from rides group by usertype;
```
usertype count
Customer 2751
Subscriber 4542

### 2.1 What is the break-down of short rides between usertypes?

```  select usertype, count(*) [count]
from rides
where duration < (15*60)
group by usertype
```
usertype count
Customer 1041
Subscriber 3348

Based on these differences, where subscribers account for 62% of all rides but take 76% of these short trips I guess that the metro card holders are counted among subscribers rather than customers.

## 3 Average rides per-bike in Q1 2019?

```select round(avg(numrides)) [average rides]
from (
select bikeid, count(*) numrides
from rides
group by bikeid);
```
average rides
15.0

## 4 What is the median idle time between rides across all bikes?

I'm pretty sure I've worked out this query correctly. By taking the difference in time between rentals per-bike I count "idle time". Populating a temporary table just makes it easier to calculate the median.

```create temporary table idletime (
bikeid,
seconds
);

insert into idletime
select bikeid,
strftime('%s', starttime) - strftime('%s', lag(endtime) over
(partition by bikeid
order by endtime))
from rides;

select (avg(seconds)/60)/60 hours
from (select seconds
from idletime
order by seconds
limit 2 - (select count(*) from idletime) % 2    -- odd 1, even 2
offset (select (count(*) - 1) / 2
from idletime))
```
hours
30.1833333333333

## 5 Distance travelled per bike

This was easy but there is too much data for the raw results to be very insightful.

```with [per bike distance] as (
select bikeid, starttime, pd.distance
from rides r
join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id
)
select bikeid, round(sum(distance)) total
from [per bike distance]
group by bikeid
order by total desc
limit 10;
```
bikeid total
70611 74711.0
70511 73499.0
70672 68515.0
70631 67937.0
70525 66937.0
70649 65583.0
70532 65553.0
70538 64952.0
70608 64374.0
70583 57860.0

### 5.1 summary information

```create temporary table summary (
bikeid,
distance
);

with [per bike distance] as (
select bikeid, starttime, pd.distance
from rides r
join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id
) insert into summary
select bikeid, sum(distance)
from [per bike distance]
group by bikeid;

select round(min(distance)) minimum, round(max(distance)) maximum, round(avg(distance)) average
from summary;

select round(avg(distance)) median
from (select distance
from summary
order by distance
limit 2 - (select count(*) from summary) % 2    -- odd 1, even 2
offset (select (count(*) - 1) / 2
from summary))
```
minimum maximum average median
714.0 74711.0 22193.0 19921.0

## 6 Which are the most popular stations for circuits?

I refer to any trip that starts and ends at the same station as a circuit. This is probably best seen as a proxy for sightseeing locations.

```  select name, count(*) circuits
from rides
join stations on from_id=stations.id
where from_id = to_id
group by from_id
order by circuits desc
limit 10;
```
name circuits
33rd St and Penn Ave 103
S 27th St & Sidney St. (Southside Works) 101
Liberty Ave & Stanwix St 78
North Shore Trail & Ft Duquesne Bridge 67
S 18th St & Sidney St 60
Isabella St & Federal St (PNC Park) 36
First Ave & B St (T Station) 35
Third Ave & Wood St 32
S 12th St & E Carson St 30
Forbes Ave & Market Square 26

## 7 Which are the most unpopular stations?

It would be useful to know of extenuating circumstances here — were these installed during the quarter? Were they closed? It is impossible to say based on this dataset alone.

```  select stations.name, count(*) [count]
from stations
join rides on rides.from_id=stations.id or rides.to_id=stations.id
group by stations.id
order by [count] asc
limit 10;
```
name count
Hamilton Ave & N Lang Ave 2
Bennett St & Eccrue Way 4
E Liberty Blvd & Larimer Ave 4
Hamilton Ave & N Dallas Ave 6
Fifth Ave & Miltenberger 8
Frankstown Ave & E Liberty Blvd 10
Hamilton Ave & Zodiac Way 11
Centre Ave & Kirkpatrick St 12
Forbes Ave & Murray Ave 12
Liberty Ave & Fisk St 18

## 8 What is station-use by day of the week?

I haven't decided what the best format for this would be. I think it could make sense as a kind of sparkline plot or general multiplot but there are so many stations that it would still be very visually dense.

```  select from_id [departing station], strftime('%w', starttime) day, count(*) [count]
from rides
group by [departing station], day
order by [departing station], day
limit 14;
```
departing station day count
1000 0 51
1000 1 16
1000 2 28
1000 3 26
1000 4 28
1000 5 25
1000 6 44
1001 0 22
1001 1 33
1001 2 30
1001 3 44
1001 4 50
1001 5 31
1001 6 27

## 9 Bikes returned to near-by racks

I'm not totally satisfied with this query. I had hoped that it might be possible to identify when the source bike rack was at capacity but there doesn't seem to be enough information here to make such a determination.

```select usertype, s1.name [from], s2.name [to], (duration/60) minutes, distance
from rides r
join pairdistance pd on r.from_id=pd.from_id and r.to_id=pd.to_id
join stations s1 on r.from_id=s1.id
join stations s2 on r.to_id=s2.id
where r.from_id <> r.to_id
and distance < 300
and duration > 600
limit 10;
```
usertype from to minutes distance
Customer Liberty Ave & 6th St Forbes Ave & Market Square 31 145.239650442889
Customer Forbes Ave & Grant St Smithfield St & Fourth Ave 17 180.100642583674
Customer Third Ave & Wood St Smithfield St & Fourth Ave 59 187.374093661187
Customer Smithfield St & Fourth Ave Third Ave & Wood St 12 187.374093661187
Customer Tennyson Ave & Fifth Ave Schenley Dr & Forbes Ave (Schenley Plaza) 26 292.364498319696
Customer Forbes Ave & Market Square Third Ave & Wood St 15 229.868921089702
Subscriber Fort Duquesne Blvd & 7th 9th St & Penn Ave 10 248.06802267547
Subscriber Third Ave & Wood St Smithfield St & Fourth Ave 26 187.374093661187
Subscriber Wood St & Sixth Ave Liberty Ave & 6th St 13 200.711833366798
Customer Butler St & 48th St Butler St & Stanton Ave 34 286.217564216965