A faster conditional subset-CodePudding

I'm trying to modify my dataframe according to the value in one column and having the most common value in another column.

df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
                 assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
                 team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))

  points assists team
1      1       6    A
2      2       6    A
3      4       5    A
4      3       6    A
5      4       6    A
6      8       9    C
7      3       9    C
8      3       1    C
9      2       1    C

to look like this:

df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
                 assists=c(6, 6, 6, 6, 1, 1),
                 team=c('A', 'A', 'A', 'A', 'C', 'C'))

  points assists team
1      1       6    A
2      2       6    A
3      3       6    A
4      4       6    A
5      8       1    C
6      3       1    C

The goal is to keep all rows that have the values A and C in the "team" column as long as in the "assists" column the most common value ("6" for "A" ) is kept. If there is a tie (such as "9" and "1" for "C") the last most common value should be kept.

I do this with a for loop but my dataframe has 3,000,000 rows and the process was very slow. Does anyone know a faster alternative?

CodePudding user response：

We could modify the Mode function and do a group by approach to filter

library(dplyr)
 Mode <- function(x) {
   # get the unique elements
   ux <- unique(x)
   # convert to integer index with match and get the frequency
   # tabulate should be faster than table
   freq <- tabulate(match(x, ux))
   # use == on the max of the freq, get the corresponding ux
   # then get the last elements of ux
   last(ux[freq == max(freq)])
 }
df %>% 
   # grouped by team
   group_by(team) %>% 
   # filter only those assists that are returned from Mode function
   filter(assists %in% Mode(assists)) %>%
   ungroup

-output

# A tibble: 6 × 3
  points assists team 
   <dbl>   <dbl> <chr>
1      1       6 A    
2      2       6 A    
3      3       6 A    
4      4       6 A    
5      3       1 C    
6      2       1 C

Or may use data.table methods for a faster execution

library(data.table)
# setDT - converts data.frame to data.table
# create a frequency column (`.N`) by assignment (`:=`)
# grouped by team, assists columns
setDT(df)[, N := .N, by = .(team, assists)]
# grouped by team, get the index of the max N from reverse (`.N:1`)
#subset the assists with that index
# create a logical vector with %in%
# get the row index -.I, which creates a default column V1
# extract the column ($V1) and use that to subset the data
df[df[, .I[assists %in% assists[.N - which.max(N[.N:1])   1]],
     by = team]$V1][, N := NULL][]
   points assists   team
    <num>   <num> <char>
1:      1       6      A
2:      2       6      A
3:      3       6      A
4:      4       6      A
5:      3       1      C
6:      2       1      C