I have a data set representing the NYC Citibike trips. The data looks like this:
> str(citibank)
spec_tbl_df [33,912,996 × 15] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
$ tripduration : num [1:33912996] 680 1282 648 631 621 ...
$ starttime : POSIXct[1:33912996], format: "2017-01-01 00:00:21" "2017-01-01 00:00:45" "2017-01-01 00:00:57" "2017-01-01 00:01:10" ...
$ stoptime : POSIXct[1:33912996], format: "2017-01-01 00:11:41" "2017-01-01 00:22:08" "2017-01-01 00:11:46" "2017-01-01 00:11:42" ...
$ start.station.id : num [1:33912996] 3226 3263 3143 3143 3143 ...
$ start.station.name : chr [1:33912996] "W 82 St & Central Park West" "Cooper Square & E 7 St" "5 Ave & E 78 St" "5 Ave & E 78 St" ...
$ start.station.latitude : num [1:33912996] 40.8 40.7 40.8 40.8 40.8 ...
$ start.station.longitude: num [1:33912996] -74 -74 -74 -74 -74 ...
$ end.station.id : num [1:33912996] 3165 498 3152 3152 3152 ...
$ end.station.name : chr [1:33912996] "Central Park West & W 72 St" "Broadway & W 32 St" "3 Ave & E 71 St" "3 Ave & E 71 St" ...
$ end.station.latitude : num [1:33912996] 40.8 40.7 40.8 40.8 40.8 ...
$ end.station.longitude : num [1:33912996] -74 -74 -74 -74 -74 ...
$ bikeid : num [1:33912996] 25542 21136 18147 21211 26819 ...
$ usertype : chr [1:33912996] "Subscriber" "Subscriber" "Customer" "Customer" ...
$ birth.year : int [1:33912996] 1965 1987 NA NA NA 2000 1973 1977 1989 1980 ...
$ gender : num [1:33912996] 2 2 0 0 0 1 1 2 1 1 ...
I want to calculate the number of times commuters travel between each two locations (from start.station.id
to end.station.id
), but I don't know how to do it. I tried using dplyr
functions group_by
and summarise
but I did not obtain the results I wanted, I am not very familiar with the package.
Can someone please give me a hint as to how to do it? Thank you in advance.
EDIT
Here is the dput()
output:
dput(head(citibank))
structure(list(tripduration = c(680, 1282, 648, 631, 621, 666),
starttime = structure(c(1483228821, 1483228845, 1483228857,
1483228870, 1483228885, 1483228911), tzone = "UTC", class =
c("POSIXct", "POSIXt")), stoptime = structure(c(1483229501,
1483230128, 1483229506,1483229502, 1483229507, 1483229577),
tzone = "UTC", class = c("POSIXct","POSIXt")),
`start station id` = c(3226, 3263, 3143, 3143,3143, 3163),
`start station name` = c("W 82 St & Central Park West", "Cooper Square & E 7 St", "5 Ave & E 78 St", "5 Ave & E 78 St", "5 Ave & E 78 St", "Central Park West & W 68 St"),
`start station latitude` = c(40.78275, 40.7292364991001, 40.7768286343997, 40.7768286343997, 40.7768286343997,40.7734066),
`start station longitude` = c(-73.97137,-73.9908680319786,-73.9638876914978, -73.9638876914978, -73.9638876914978, -73.97782542),
`end station id` = c(3165, 498, 3152, 3152, 3152, 3163),
`end station name` = c("Central Park West & W 72 St","Broadway & W 32 St", "3 Ave & E 71 St", "3 Ave & E 71 St", "3 Ave & E 71 St","Central Park West & W 68 St"),
`end station latitude` = c(40.7757937668367,40.74854862, 40.76873687, 40.76873687, 40.76873687, 40.7734066),
`end station longitude` = c(-73.9762057363987, -73.98808416, -73.96119945, -73.96119945, -73.96119945, -73.97782542),
bikeid = c(25542,21136, 18147, 21211, 26819, 16050),
usertype = c("Subscriber", "Subscriber", "Customer", "Customer", "Customer", "Subscriber"),
`birth year` = c(1965L, 1987L, NA, NA, NA, 2000L),
gender = c(2,2, 0, 0, 0, 1)), row.names = c(NA, -6L),
class = c("tbl_df", "tbl", "data.frame"))
CodePudding user response:
I highly recommend reading R For Data Science
library(dplyr)
citibank %>%
#group the data using the start and end stations
group_by(start.station.id, end.station.id) %>%
#count the number of records for each combination of stations
summarise(count = n())