Home > Enterprise >  R Create a data frame summarising per column unique values from another data frame and sort
R Create a data frame summarising per column unique values from another data frame and sort

Time:07-13

This question follows on from a discussion began here. I felt that further elaboration added warranted the posting of a new question as it diverges from the question in the original post. Apologies, if I am mistaken in doing so.

Given a data frame similar to the following:

mydf <- data.frame(X1=c("0_times","3-10_times", "11-20_times", "1-2_times","3-10_times",
                        "0_times","3-10_times", "11-20_times", "1-2_times","3-10_times" ),
                   X2=c('ab','bb','cb','db','eb','ab','bb','cb','db','eb'),
                   X3=c("11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times",
                        "11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times"),
                   X4=c("foo", "bar","fizz","buzz","weee","foo", "bar","fizz","buzz","weee"),
                   X5=c("3-10_times","1-2_times","0_times","more_than_30_times","11-20_times",
                        "21-30_times","1-2_times","0_times","3-10_times","11-20_times")
                   )

I would like to create a second data frame to store the column names and a list/vector of unique values from the first dataframe.

Resulting in something like:

 names   vals
1    X1 0_times, 1-2_times, 3-10_times, 11-20_times
2    X2 ab, bb, cb, db, eb
3    X3 1-2_times, 3-10_times, 11-20_times , 21-30_times, more_than_30_times 
4    X4 foo, bar, fizz, buzz, weee 
1    X5 0_times,1-2_times,3-10_times,11-20_times,21-30_times

I create the second data frame using:

mydf2 <- data.frame(names = colnames(mydf))
mydf2$vals <- lapply(mydf, unique)

I think this is okay so far. However, the challenge I have is that I need vectors containing digits (in this case only mydf2$X1) to be sorted in ascending order using more than just the first digit of each item.

With some great help from a Stack user, this was suggested as a means to sort the vectors containing digits and it works perfectly on a single vector:

mylist <- c('0_times','3-10_times','11_20_times','1-2_times','more_than_20_times')

o <- sapply(strsplit(mylist, '\\D '), function(x) min(as.numeric(x[nzchar(x)])))
mylist[order(o)]

When I attempt to apply this to the entire mydf2$vals column by substituting in the column name:

o <- sapply(strsplit(mydf2$vals, '\\D '), function(x) min(as.numeric(x[nzchar(x)])))
mydf2$vals[order(o)]

I receive the error error in evaluating the argument 'X' in selecting a method for function 'sapply': non-character argument

So I have 2 questions:

  1. Is there a simpler way of achieving my aims?
  2. How can I amend the suggested sort function so the error doesn't occur?

CodePudding user response:

You could wrap @Allan Cameron's logic in a function, that extends it to sort the columns' unique values. if there are numbers included, which grepl will tell us, we apply the logic. I called it makeLevels because it might be useful to create factor levels as well.

makeLevels <- \(x) {
  if (any(grepl('\\d', x))) {
    unique(x[order(sapply(strsplit(x, '\\D '), function(x) min(as.numeric(x[nzchar(x)]))))])
  } else {
    sort(unique(x))
  }
}

lapply(names(mydf), \(x) data.frame(names=x, vals=toString(makeLevels(mydf[[x]])))) |>
  do.call(what=rbind)
#   names                                                                         vals
# 1    X1                                  0_times, 1-2_times, 3-10_times, 11-20_times
# 2    X2                                                           ab, bb, cb, db, eb
# 3    X3          1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times
# 4    X4                                                   bar, buzz, fizz, foo, weee
# 5    X5 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times

Or do something like this,

lapply(names(mydf), \(x) sprintf('$ %s <%s>: %s', x, class(mydf[[x]]), toString(makeLevels(mydf[[x]])))) |>
  do.call(what=rbind)
#      [,1]                                                                                            
# [1,] "$ X1 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times"                                 
# [2,] "$ X2 <character>: ab, bb, cb, db, eb"                                                          
# [3,] "$ X3 <character>: 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"         
# [4,] "$ X4 <character>: bar, buzz, fizz, foo, weee"                                                  
# [5,] "$ X5 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"

which is similar to what str(mydf) does; you actually could lapply makeLevels and str it, if you just wanna read it.

str(lapply(mydf, makeLevels), vec.len=10L)
# List of 5
#  $ X1: chr [1:4] "0_times" "1-2_times" "3-10_times" "11-20_times"
#  $ X2: chr [1:5] "ab" "bb" "cb" "db" "eb"
#  $ X3: chr [1:5] "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"
#  $ X4: chr [1:5] "bar" "buzz" "fizz" "foo" "weee"
#  $ X5: chr [1:6] "0_times" "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"

...

CodePudding user response:

Perhaps just reset the order of mydf (i.e. pasting a leading 0 to the X1 column):

mydf <- mydf[order(ifelse(substr(mydf$X1,1,2)!=11, paste0("0",mydf$X1), mydf$X1)),]

Then run the same code above for mydf2, and you will get this:

  names                                        vals
1    X1 0_times, 1-2_times, 3-10_times, 11-20_times
2    X2                          ab, db, bb, eb, cb

Is that the expected output?

  •  Tags:  
  • r
  • Related