I'm trying to modify my dataframe according to the value in one column and having the most common value in another column.
df <- data.frame(points=c(1, 2, 4, 3, 4, 8, 3, 3, 2),
assists=c(6, 6, 5, 6, 6, 9, 9, 1, 1),
team=c('A', 'A', 'A', 'A', 'A', 'C', 'C', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 4 5 A
4 3 6 A
5 4 6 A
6 8 9 C
7 3 9 C
8 3 1 C
9 2 1 C
to look like this:
df2 <- data.frame(points=c(1, 2, 3, 4, 8, 3),
assists=c(6, 6, 6, 6, 1, 1),
team=c('A', 'A', 'A', 'A', 'C', 'C'))
points assists team
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 8 1 C
6 3 1 C
The goal is to keep all rows that have the values A and C in the "team" column as long as in the "assists" column the most common value ("6" for "A" ) is kept. If there is a tie (such as "9" and "1" for "C") the last most common value should be kept.
I do this with a for loop but my dataframe has 3,000,000 rows and the process was very slow. Does anyone know a faster alternative?
CodePudding user response:
We could modify the Mode function and do a group by approach to filter
library(dplyr)
Mode <- function(x) {
# get the unique elements
ux <- unique(x)
# convert to integer index with match and get the frequency
# tabulate should be faster than table
freq <- tabulate(match(x, ux))
# use == on the max of the freq, get the corresponding ux
# then get the last elements of ux
last(ux[freq == max(freq)])
}
df %>%
# grouped by team
group_by(team) %>%
# filter only those assists that are returned from Mode function
filter(assists %in% Mode(assists)) %>%
ungroup
-output
# A tibble: 6 × 3
points assists team
<dbl> <dbl> <chr>
1 1 6 A
2 2 6 A
3 3 6 A
4 4 6 A
5 3 1 C
6 2 1 C
Or may use data.table
methods for a faster execution
library(data.table)
# setDT - converts data.frame to data.table
# create a frequency column (`.N`) by assignment (`:=`)
# grouped by team, assists columns
setDT(df)[, N := .N, by = .(team, assists)]
# grouped by team, get the index of the max N from reverse (`.N:1`)
#subset the assists with that index
# create a logical vector with %in%
# get the row index -.I, which creates a default column V1
# extract the column ($V1) and use that to subset the data
df[df[, .I[assists %in% assists[.N - which.max(N[.N:1]) 1]],
by = team]$V1][, N := NULL][]
points assists team
<num> <num> <char>
1: 1 6 A
2: 2 6 A
3: 3 6 A
4: 4 6 A
5: 3 1 C
6: 2 1 C