Home > database >  R: Count number of trips between two locations
R: Count number of trips between two locations

Time:03-19

I have a data set representing the NYC Citibike trips. The data looks like this:

> str(citibank)
spec_tbl_df [33,912,996 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ tripduration           : num [1:33912996] 680 1282 648 631 621 ...
$ starttime              : POSIXct[1:33912996], format: "2017-01-01 00:00:21" "2017-01-01 00:00:45" "2017-01-01 00:00:57" "2017-01-01 00:01:10" ...
$ stoptime               : POSIXct[1:33912996], format: "2017-01-01 00:11:41" "2017-01-01 00:22:08" "2017-01-01 00:11:46" "2017-01-01 00:11:42" ...
$ start.station.id       : num [1:33912996] 3226 3263 3143 3143 3143 ...
$ start.station.name     : chr [1:33912996] "W 82 St & Central Park West" "Cooper Square & E 7 St" "5 Ave & E 78 St" "5 Ave & E 78 St" ...
$ start.station.latitude : num [1:33912996] 40.8 40.7 40.8 40.8 40.8 ...
$ start.station.longitude: num [1:33912996] -74 -74 -74 -74 -74 ...
$ end.station.id         : num [1:33912996] 3165 498 3152 3152 3152 ...
$ end.station.name       : chr [1:33912996] "Central Park West & W 72 St" "Broadway & W 32 St" "3 Ave & E 71 St" "3 Ave & E 71 St" ...
$ end.station.latitude   : num [1:33912996] 40.8 40.7 40.8 40.8 40.8 ...
$ end.station.longitude  : num [1:33912996] -74 -74 -74 -74 -74 ...
$ bikeid                 : num [1:33912996] 25542 21136 18147 21211 26819 ...
$ usertype               : chr [1:33912996] "Subscriber" "Subscriber" "Customer" "Customer" ...
$ birth.year             : int [1:33912996] 1965 1987 NA NA NA 2000 1973 1977 1989 1980 ...
$ gender                 : num [1:33912996] 2 2 0 0 0 1 1 2 1 1 ...

I want to calculate the number of times commuters travel between each two locations (from start.station.id to end.station.id), but I don't know how to do it. I tried using dplyr functions group_by and summarise but I did not obtain the results I wanted, I am not very familiar with the package.

Can someone please give me a hint as to how to do it? Thank you in advance.

EDIT

Here is the dput() output:

dput(head(citibank))
structure(list(tripduration = c(680, 1282, 648, 631, 621, 666), 
starttime = structure(c(1483228821, 1483228845, 1483228857, 
1483228870, 1483228885, 1483228911), tzone = "UTC", class = 
c("POSIXct", "POSIXt")), stoptime = structure(c(1483229501, 
1483230128, 1483229506,1483229502, 1483229507, 1483229577), 
tzone = "UTC", class = c("POSIXct","POSIXt")), 
`start station id` = c(3226, 3263, 3143, 3143,3143, 3163), 
`start station name` = c("W 82 St & Central Park West", "Cooper Square & E 7 St", "5 Ave & E 78 St", "5 Ave & E 78 St", "5 Ave & E 78 St", "Central Park West & W 68 St"), 
`start station latitude` = c(40.78275, 40.7292364991001, 40.7768286343997, 40.7768286343997, 40.7768286343997,40.7734066), 
`start station longitude` = c(-73.97137,-73.9908680319786,-73.9638876914978, -73.9638876914978, -73.9638876914978, -73.97782542), 
`end station id` = c(3165, 498, 3152, 3152, 3152, 3163), 
`end station name` = c("Central Park West & W 72 St","Broadway & W 32 St", "3 Ave & E 71 St", "3 Ave & E 71 St", "3 Ave & E 71 St","Central Park West & W 68 St"), 
`end station latitude` = c(40.7757937668367,40.74854862, 40.76873687, 40.76873687, 40.76873687, 40.7734066), 
`end station longitude` = c(-73.9762057363987, -73.98808416, -73.96119945, -73.96119945, -73.96119945, -73.97782542), 
bikeid = c(25542,21136, 18147, 21211, 26819, 16050), 
usertype = c("Subscriber", "Subscriber", "Customer", "Customer", "Customer", "Subscriber"), 
`birth year` = c(1965L, 1987L, NA, NA, NA, 2000L), 
gender = c(2,2, 0, 0, 0, 1)), row.names = c(NA, -6L), 
class = c("tbl_df", "tbl", "data.frame"))

CodePudding user response:

I highly recommend reading R For Data Science

library(dplyr)

citibank %>% 
  #group the data using the start and end stations
  group_by(start.station.id, end.station.id) %>% 
  #count the number of records for each combination of stations
  summarise(count = n())
  • Related