I have a dataset with two columns, "ID" and "CODCOM" with about 1 milion of rows. The first column "ID" contain duplicate values.
ID | CODCOM |
---|---|
10000 | 12 |
101010 | 14 |
201020 | 11 |
201020 | 11 |
201020 | 12 |
324032 | 43 |
324032 | 43 |
324032 | 43 |
405044 | 51 |
323032 | 21 |
I want to group "ID" duplicated values in different groups, then calculate the mode for each groups, after that I want to create a new column with the related mode values. Something like this:
ID | CODCOM | NEW_COL |
---|---|---|
10000 | 12 | 12 |
101010 | 14 | 14 |
201020 | 11 | 11 |
201020 | 11 | 11 |
201020 | 12 | 11 |
324032 | 43 | 43 |
324032 | 43 | 43 |
324032 | 43 | 43 |
405044 | 51 | 51 |
323032 | 21 | 43 |
How can I do this in an easy way?
Thank you so much in advance for any help provided.
CodePudding user response:
A dplyr approach where I join the data to a version of itself with just the most-common CODCOM value (or first appearing with ties).
library(dplyr)
df1 %>%
left_join(
df1 %>%
group_by(ID) %>%
count(mode = CODCOM, sort = TRUE) %>%
slice(1),
by = "ID"
)
ID CODCOM mode n
1 10000 12 12 1
2 101010 14 14 1
3 201020 11 11 2
4 201020 11 11 2
5 201020 12 11 2
6 324032 43 43 3
7 324032 43 43 3
8 324032 43 43 3
9 405044 51 51 1
10 323032 21 21 1
CodePudding user response:
Please find below one solution using the package data.table
:
REPREX
- Code
library(data.table)
# Function to compute mode
mode_compute <- function(x) {
uniqx <- unique(x)
uniqx[which.max(tabulate(match(x, uniqx)))]
}
# Compute mode by ID
DT[ , MODE := mode_compute(CODCOM), by = ID]
- Output
DT
#> ID CODCOM MODE
#> 1: 10000 12 12
#> 2: 101010 14 14
#> 3: 201020 11 11
#> 4: 201020 11 11
#> 5: 201020 12 11
#> 6: 324032 43 43
#> 7: 324032 43 43
#> 8: 324032 43 43
#> 9: 405044 51 51
#> 10: 323032 21 21
- Data:
# Data
DT <- data.table(ID = c("10000", "101010", "201020", "201020", "201020",
"324032", "324032", "324032", "405044", "323032"),
CODCOM = c(12, 14, 11, 11, 12, 43, 43, 43, 51, 21))
DT
#> ID CODCOM
#> 1: 10000 12
#> 2: 101010 14
#> 3: 201020 11
#> 4: 201020 11
#> 5: 201020 12
#> 6: 324032 43
#> 7: 324032 43
#> 8: 324032 43
#> 9: 405044 51
#> 10: 323032 21
Created on 2021-10-11 by the reprex package (v0.3.0)
CodePudding user response:
If I understand you correctly: we could group_by
ID
and then use summarise
the mode
of the mode function:
If you don't want to summarise
you could use mutate
instead (will keep all rows)!
library(dplyr)
mode <- function(codes){
which.max(tabulate(codes))
}
df %>%
as_tibble() %>%
group_by(ID) %>%
summarise(NEW_COL = mode(CODCOM))
ID NEW_COL
<int> <int>
1 10000 12
2 101010 14
3 201020 11
4 323032 21
5 324032 43
6 405044 51
CodePudding user response:
Base R solution:
# Option 1 using TarJae's mode function:
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
df,
ave(
CODCOM,
ID,
FUN = function(x){
which.max(tabulate(x))
}
)
)
# Option two:
# Function to calculate the mode of a vector:
# mode_statistic => function()
mode_statistic <- function(x){
# Calculate the mode: res => vector
res <- names(
head(
sort(
table(
x
),
decreasing = TRUE
),
1
)
)
# Explicitly define returned object: character vector => env
return(res)
}
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
df,
ave(
CODCOM,
ID,
FUN = function(x){
as.integer(
mode_statistic(x)
)
}
)
)