Home > Blockchain >  R: How to count frequency of combinations between two string columns (using dplyr, tidyr, or base r)
R: How to count frequency of combinations between two string columns (using dplyr, tidyr, or base r)

Time:11-02

I have a large data frame of station-to-station bike trips and it is my goal to identify the most common combinations of start and end stations. My df looks like this:

df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
                 end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'),
                 start_lat = c(12.3456, 23.4567, 34.5678, 12.3456, 12.3456, 23.4567),
                 start_lng = c(09.8765, 98.7654, 87.6543, 09.8765, 09.8765, 98.7654)
)

Ideally, I want an output that creates a ranked list of station combinations in descending order of frequency and a new column, 'ride_count' that reflects the number of exact combinations.

In the above example, I want the output to be a new dataframe I can further manipulate/visualize

start_station   end_station   ride_count start_lat   start_lng
Apple            Bungalow      3          12.3456    09.8765
Bungalow         Apple         2          23.4567    98.7654
Carrot           Carrot        1          34.5678    87.6543

Per a previous recommendation, the 'count()' command does perform the correct calculation, however I lose the other data associated with each station, like start_lat and start_lng.

Is there a way to preserve these columns?

A sincere thanks to anyone for their assistance. I've been plowing through this project efficiently, but I'm really struggling with this final geographic element.

CodePudding user response:

If we assume that other non-grouping fields are always invariant within the group, then we can do this:

library(dplyr)
df %>%
  group_by(start_station, end_station) %>%
  summarize(n = n(), across(everything(), first), .groups = "drop")
# # A tibble: 3 x 5
#   start_station end_station     n start_lat start_lng
#   <chr>         <chr>       <int>     <dbl>     <dbl>
# 1 Apple         Bungalow        3      12.3      9.88
# 2 Bungalow      Apple           2      23.5     98.8 
# 3 Carrot        Carrot          1      34.6     87.7 

If anything is variable, however, then you need to consider how to aggregate each column individually.

CodePudding user response:

df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
                 end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'))
df_counts <- tapply(df$start_station, 
                    paste(df$start_station, df$end_station),
                    length) |> 
  as.data.frame() |>
  `colnames<-`(c('count'))
idx <- order(df_counts$count, decreasing = TRUE)
df_counts <- df_counts[ idx, , drop = FALSE]
print(df_counts)
               count
Apple Bungalow     3
Bungalow Apple     2
Carrot Carrot      1

CodePudding user response:

Edited dplyr approach. This adds a count for each combination of stations, and removes duplicate rows. This may not work to 'summarize' things as you demonstrated depending on the contents of other columns (lat and lng being consistent for a station allows this to work in the example).

library(tidyverse)

df <- data.frame(start_station = c('Apple', 'Bungalow', 'Carrot', 'Apple', 'Apple', 'Bungalow'),
                 end_station = c('Bungalow', 'Apple', 'Carrot', 'Bungalow', 'Bungalow', 'Apple'),
                 start_lat = c(12.3456, 23.4567, 34.5678, 12.3456, 12.3456, 23.4567),
                 start_lng = c(09.8765, 98.7654, 87.6543, 09.8765, 09.8765, 98.7654)
)

df %>%
  add_count(start_station, end_station, name = 'ride_count') %>%
  distinct()
#>   start_station end_station start_lat start_lng ride_count
#> 1         Apple    Bungalow   12.3456    9.8765          3
#> 2      Bungalow       Apple   23.4567   98.7654          2
#> 3        Carrot      Carrot   34.5678   87.6543          1
  • Related