Home > OS >  Rearrange rows and calculate mode in R by creating a new variable
Rearrange rows and calculate mode in R by creating a new variable

Time:10-12

I have a dataset with two columns, "ID" and "CODCOM" with about 1 milion of rows. The first column "ID" contain duplicate values.

ID CODCOM
10000 12
101010 14
201020 11
201020 11
201020 12
324032 43
324032 43
324032 43
405044 51
323032 21

I want to group "ID" duplicated values in different groups, then calculate the mode for each groups, after that I want to create a new column with the related mode values. Something like this:

ID CODCOM NEW_COL
10000 12 12
101010 14 14
201020 11 11
201020 11 11
201020 12 11
324032 43 43
324032 43 43
324032 43 43
405044 51 51
323032 21 43

How can I do this in an easy way?

Thank you so much in advance for any help provided.

CodePudding user response:

A dplyr approach where I join the data to a version of itself with just the most-common CODCOM value (or first appearing with ties).

library(dplyr)
df1 %>%
  left_join(
    df1 %>%
      group_by(ID) %>%
      count(mode = CODCOM, sort = TRUE) %>%
      slice(1),
    by = "ID"
  )


       ID CODCOM mode n
1   10000     12   12 1
2  101010     14   14 1
3  201020     11   11 2
4  201020     11   11 2
5  201020     12   11 2
6  324032     43   43 3
7  324032     43   43 3
8  324032     43   43 3
9  405044     51   51 1
10 323032     21   21 1

CodePudding user response:

Please find below one solution using the package data.table:

REPREX

  • Code
library(data.table)


# Function to compute mode
mode_compute <- function(x) {
  uniqx <- unique(x)
  uniqx[which.max(tabulate(match(x, uniqx)))]
}

# Compute mode by ID
DT[ , MODE := mode_compute(CODCOM), by = ID]
  • Output
DT
#>         ID CODCOM MODE
#>  1:  10000     12   12
#>  2: 101010     14   14
#>  3: 201020     11   11
#>  4: 201020     11   11
#>  5: 201020     12   11
#>  6: 324032     43   43
#>  7: 324032     43   43
#>  8: 324032     43   43
#>  9: 405044     51   51
#> 10: 323032     21   21
  • Data:
# Data
DT <- data.table(ID = c("10000", "101010", "201020", "201020", "201020",
                 "324032", "324032", "324032", "405044", "323032"),
                 CODCOM = c(12, 14, 11, 11, 12, 43, 43, 43, 51, 21))
DT
#>         ID CODCOM
#>  1:  10000     12
#>  2: 101010     14
#>  3: 201020     11
#>  4: 201020     11
#>  5: 201020     12
#>  6: 324032     43
#>  7: 324032     43
#>  8: 324032     43
#>  9: 405044     51
#> 10: 323032     21

Created on 2021-10-11 by the reprex package (v0.3.0)

CodePudding user response:

If I understand you correctly: we could group_by ID and then use summarise the mode of the mode function:

If you don't want to summarise you could use mutate instead (will keep all rows)!

library(dplyr)

mode <- function(codes){
  which.max(tabulate(codes))
}

df %>% 
  as_tibble() %>% 
  group_by(ID) %>% 
  summarise(NEW_COL = mode(CODCOM))
      ID NEW_COL
   <int>   <int>
1  10000      12
2 101010      14
3 201020      11
4 323032      21
5 324032      43
6 405044      51

CodePudding user response:

Base R solution:

# Option 1 using TarJae's mode function:
# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
  df,
  ave(
    CODCOM,
    ID,
    FUN = function(x){
      which.max(tabulate(x))
    }
  )
)

# Option two:
# Function to calculate the mode of a vector: 
# mode_statistic => function()
mode_statistic <- function(x){
  # Calculate the mode: res => vector
  res <- names(
    head(
      sort(
        table(
          x
        ),
        decreasing = TRUE
      ),
      1
    )
  )
  # Explicitly define returned object: character vector => env
  return(res)
}

# Apply the function groupwise, store result as vector:
# NEW_COL => integer vector
df$NEW_COL <- with(
  df,
  ave(
    CODCOM,
    ID,
    FUN = function(x){
      as.integer(
        mode_statistic(x)
      )
    }
  )
)
  • Related