I have a large data frame of station-to-station bike trips and it is my goal to identify the most common combinations of start and end stations. My df looks like this:
df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'),
start_lat = c(12.3456, 23.4567, 34.5678, 12.3456, 12.3456, 23.4567),
start_lng = c(09.8765, 98.7654, 87.6543, 09.8765, 09.8765, 98.7654)
)
Ideally, I want an output that creates a ranked list of station combinations in descending order of frequency and a new column, 'ride_count' that reflects the number of exact combinations.
In the above example, I want the output to be a new dataframe I can further manipulate/visualize
start_station end_station ride_count start_lat start_lng
Apple Bungalow 3 12.3456 09.8765
Bungalow Apple 2 23.4567 98.7654
Carrot Carrot 1 34.5678 87.6543
Per a previous recommendation, the 'count()' command does perform the correct calculation, however I lose the other data associated with each station, like start_lat and start_lng.
Is there a way to preserve these columns?
A sincere thanks to anyone for their assistance. I've been plowing through this project efficiently, but I'm really struggling with this final geographic element.
CodePudding user response:
If we assume that other non-grouping fields are always invariant within the group, then we can do this:
library(dplyr)
df %>%
group_by(start_station, end_station) %>%
summarize(n = n(), across(everything(), first), .groups = "drop")
# # A tibble: 3 x 5
# start_station end_station n start_lat start_lng
# <chr> <chr> <int> <dbl> <dbl>
# 1 Apple Bungalow 3 12.3 9.88
# 2 Bungalow Apple 2 23.5 98.8
# 3 Carrot Carrot 1 34.6 87.7
If anything is variable, however, then you need to consider how to aggregate each column individually.
CodePudding user response:
df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'))
df_counts <- tapply(df$start_station,
paste(df$start_station, df$end_station),
length) |>
as.data.frame() |>
`colnames<-`(c('count'))
idx <- order(df_counts$count, decreasing = TRUE)
df_counts <- df_counts[ idx, , drop = FALSE]
print(df_counts)
count
Apple Bungalow 3
Bungalow Apple 2
Carrot Carrot 1
CodePudding user response:
Edited dplyr approach. This adds a count for each combination of stations, and removes duplicate rows. This may not work to 'summarize' things as you demonstrated depending on the contents of other columns (lat and lng being consistent for a station allows this to work in the example).
library(tidyverse)
df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'),
start_lat = c(12.3456, 23.4567, 34.5678, 12.3456, 12.3456, 23.4567),
start_lng = c(09.8765, 98.7654, 87.6543, 09.8765, 09.8765, 98.7654)
)
df %>%
add_count(start_station, end_station, name = 'ride_count') %>%
distinct()
#> start_station end_station start_lat start_lng ride_count
#> 1 Apple Bungalow 12.3456 9.8765 3
#> 2 Bungalow Apple 23.4567 98.7654 2
#> 3 Carrot Carrot 34.5678 87.6543 1