Home > Blockchain >  Second Most Common Element in Each Row
Second Most Common Element in Each Row

Time:06-15

I have this dataset in R:

library(stringr)

    set.seed(999)

  col1 = sample.int(5, 100, replace = TRUE)
    col2 = sample.int(5, 100, replace = TRUE)
    col3 = sample.int(5, 100, replace = TRUE)
    col4 = sample.int(5, 100, replace = TRUE)
    col5 = sample.int(5, 100, replace = TRUE)
    col6 = sample.int(5, 100, replace = TRUE)
    col7 = sample.int(5, 100, replace = TRUE)
    col8 = sample.int(5, 100, replace = TRUE)
    col9 = sample.int(5, 100, replace = TRUE)
    col10 = sample.int(5, 100, replace = TRUE)
    
  
    d = data.frame(id = 1:10, seq =  c(paste(col1, collapse = ""),  paste(col2, collapse = ""),  paste(col3, collapse = ""),  paste(col4, collapse = ""),  paste(col5, collapse = ""),  paste(col6, collapse = ""),  paste(col7, collapse = ""),  paste(col8, collapse = ""),  paste(col9, collapse = ""), paste(col10, collapse = "")))

For each row, I would like to create new variables:

  • d$most_common: the most common element in each row
  • d$second_most_common: the second most common element in each row
  • d$third_most_common: the third most common element in each row

I tried to do this with the following function (Find the most frequent value by row):

rowMode <- function(x, ties = NULL, include.na = FALSE) {
  # input checks data
  if ( !(is.matrix(x) | is.data.frame(x)) ) {
    stop("Your data is not a matrix or a data.frame.")
  }
  # input checks ties method
  if ( !is.null(ties) && !(ties %in% c("random", "first", "last")) ) {
    stop("Your ties method is not one of 'random', 'first' or 'last'.")
  }
  # set ties method to 'random' if not specified
  if ( is.null(ties) ) ties <- "random"
  
  # create row frequency table
  rft <- table(c(row(x)), unlist(x), useNA = c("no","ifany")[1L   include.na])
  
  # get the mode for each row
  colnames(rft)[max.col(rft, ties.method = ties)]
}

rowMode(d[1,1])

This gave me an error:

Error in rowMode(d[1, 1]) : Your data is not a matrix or a data.frame.

Which is a bit confusing, seeing as "d" is a data.frame.

  • Is there an easier way to do this?

Thank you!

CodePudding user response:

You can do this by splitting the long string on each character, pivoting longer, and counting instances by id and character, and taking the top 3..

Here is an approach using data.table

library(data.table)
setDT(d)
melt(d[, tstrsplit(seq,""), id], id.vars = "id")[, .N, .(id, value)][order(-N), .SD[1:3][,nth:=.I], id]

Output (first six rows of 30):

    id value  N nth
 1:  2     2 30   1
 2:  2     1 22   2
 3:  2     4 19   3
 4:  3     3 28   1
 5:  3     2 23   2
 6:  3     4 20   3

Here is a similar approach using dplyr with unnest() to make long:

d %>% 
  group_by(id) %>% 
  mutate(chars = strsplit(seq,"")) %>% 
  unnest(chars) %>%
  count(id, chars,sort = T) %>% 
  slice_head(n=3)

Output:

      id chars     n
   <int> <chr> <int>
 1     1 1        24
 2     1 5        20
 3     1 2        19
 4     2 2        30
 5     2 1        22
 6     2 4        19
 7     3 3        28
 8     3 2        23
 9     3 4        20
10     4 1        26

CodePudding user response:

If you need the variables "Most_common", "second_most":

You can use: mutate & str_split which counts each string in the splitted string and searches for it's order when sorted.


library dplyr

#range
r <- 1:5 |> as.character()

d |> 
  group_by(id) |> 
  mutate(most_common = which(unique(str_count(seq, r)) == last(sort(str_count(seq, r)))),
         second_most_common = first(which(str_count(seq, r) == nth(sort(str_count(seq, r)), length(r) - 1))),
         third_most_common = first(which(str_count(seq, r) == nth(sort(str_count(seq, r)), length(r) - 2))))
id seq                                                                                 most_common second_most_com… third_most_comm…
   <int> <chr>                                                                                     <int>            <int>            <int>
 1     1 3451122353321532415512241532113224441251251254542314534141431523132515542431525553…           1                5                2
 2     2 1432431521432121553144243252433424314222143112242423421524144222151123234314255321…           2                1                4
 3     3 4232245131422525453443332555312143535325221555344453323342533222344112134311342335…           3                2                4
 4     4 4252525524252335331144111244343534224454131341553141342131354215143133213214314241…           1                3                4
 5     5 2223513245222513345115334422121115412343225125312335414233115453235322543311352331…           3                2                1
 6     6 3244331444151221411123513334135553324122122233134315145451545423111325253225325141…           1                1                2
 7     7 4353332532552141211553131123521145214552211231144155553152131124221522333222343355…           5                1                3
 8     8 1432215433134223221222143432454314232514255344213444342235252213324245413213554121…           2                4                3
 9     9 2335142431432434123121254343455134511124323335211514354553145531115232541551252421…           1                1                3
10    10 1552245312213342315524134513123511112311314321112334533252141242212345432435421535…           1                3                2

  •  Tags:  
  • r
  • Related