I have this dataset in R:
library(stringr)
set.seed(999)
col1 = sample.int(5, 100, replace = TRUE)
col2 = sample.int(5, 100, replace = TRUE)
col3 = sample.int(5, 100, replace = TRUE)
col4 = sample.int(5, 100, replace = TRUE)
col5 = sample.int(5, 100, replace = TRUE)
col6 = sample.int(5, 100, replace = TRUE)
col7 = sample.int(5, 100, replace = TRUE)
col8 = sample.int(5, 100, replace = TRUE)
col9 = sample.int(5, 100, replace = TRUE)
col10 = sample.int(5, 100, replace = TRUE)
d = data.frame(id = 1:10, seq = c(paste(col1, collapse = ""), paste(col2, collapse = ""), paste(col3, collapse = ""), paste(col4, collapse = ""), paste(col5, collapse = ""), paste(col6, collapse = ""), paste(col7, collapse = ""), paste(col8, collapse = ""), paste(col9, collapse = ""), paste(col10, collapse = "")))
For each row, I would like to create new variables:
- d$most_common: the most common element in each row
- d$second_most_common: the second most common element in each row
- d$third_most_common: the third most common element in each row
I tried to do this with the following function (Find the most frequent value by row):
rowMode <- function(x, ties = NULL, include.na = FALSE) {
# input checks data
if ( !(is.matrix(x) | is.data.frame(x)) ) {
stop("Your data is not a matrix or a data.frame.")
}
# input checks ties method
if ( !is.null(ties) && !(ties %in% c("random", "first", "last")) ) {
stop("Your ties method is not one of 'random', 'first' or 'last'.")
}
# set ties method to 'random' if not specified
if ( is.null(ties) ) ties <- "random"
# create row frequency table
rft <- table(c(row(x)), unlist(x), useNA = c("no","ifany")[1L include.na])
# get the mode for each row
colnames(rft)[max.col(rft, ties.method = ties)]
}
rowMode(d[1,1])
This gave me an error:
Error in rowMode(d[1, 1]) : Your data is not a matrix or a data.frame.
Which is a bit confusing, seeing as "d" is a data.frame.
- Is there an easier way to do this?
Thank you!
CodePudding user response:
You can do this by splitting the long string on each character, pivoting longer, and counting instances by id and character, and taking the top 3..
Here is an approach using data.table
library(data.table)
setDT(d)
melt(d[, tstrsplit(seq,""), id], id.vars = "id")[, .N, .(id, value)][order(-N), .SD[1:3][,nth:=.I], id]
Output (first six rows of 30):
id value N nth
1: 2 2 30 1
2: 2 1 22 2
3: 2 4 19 3
4: 3 3 28 1
5: 3 2 23 2
6: 3 4 20 3
Here is a similar approach using dplyr
with unnest()
to make long:
d %>%
group_by(id) %>%
mutate(chars = strsplit(seq,"")) %>%
unnest(chars) %>%
count(id, chars,sort = T) %>%
slice_head(n=3)
Output:
id chars n
<int> <chr> <int>
1 1 1 24
2 1 5 20
3 1 2 19
4 2 2 30
5 2 1 22
6 2 4 19
7 3 3 28
8 3 2 23
9 3 4 20
10 4 1 26
CodePudding user response:
If you need the variables "Most_common", "second_most":
You can use: mutate
& str_split
which counts each string in the splitted string and searches for it's order when sorted.
library dplyr
#range
r <- 1:5 |> as.character()
d |>
group_by(id) |>
mutate(most_common = which(unique(str_count(seq, r)) == last(sort(str_count(seq, r)))),
second_most_common = first(which(str_count(seq, r) == nth(sort(str_count(seq, r)), length(r) - 1))),
third_most_common = first(which(str_count(seq, r) == nth(sort(str_count(seq, r)), length(r) - 2))))
id seq most_common second_most_com… third_most_comm…
<int> <chr> <int> <int> <int>
1 1 3451122353321532415512241532113224441251251254542314534141431523132515542431525553… 1 5 2
2 2 1432431521432121553144243252433424314222143112242423421524144222151123234314255321… 2 1 4
3 3 4232245131422525453443332555312143535325221555344453323342533222344112134311342335… 3 2 4
4 4 4252525524252335331144111244343534224454131341553141342131354215143133213214314241… 1 3 4
5 5 2223513245222513345115334422121115412343225125312335414233115453235322543311352331… 3 2 1
6 6 3244331444151221411123513334135553324122122233134315145451545423111325253225325141… 1 1 2
7 7 4353332532552141211553131123521145214552211231144155553152131124221522333222343355… 5 1 3
8 8 1432215433134223221222143432454314232514255344213444342235252213324245413213554121… 2 4 3
9 9 2335142431432434123121254343455134511124323335211514354553145531115232541551252421… 1 1 3
10 10 1552245312213342315524134513123511112311314321112334533252141242212345432435421535… 1 3 2