This question follows on from a discussion began here. I felt that further elaboration added warranted the posting of a new question as it diverges from the question in the original post. Apologies, if I am mistaken in doing so.
Given a data frame similar to the following:
mydf <- data.frame(X1=c("0_times","3-10_times", "11-20_times", "1-2_times","3-10_times",
"0_times","3-10_times", "11-20_times", "1-2_times","3-10_times" ),
X2=c('ab','bb','cb','db','eb','ab','bb','cb','db','eb'),
X3=c("11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times",
"11-20_times", "3-10_times","1-2_times","21-30_times","more_than_30_times"),
X4=c("foo", "bar","fizz","buzz","weee","foo", "bar","fizz","buzz","weee"),
X5=c("3-10_times","1-2_times","0_times","more_than_30_times","11-20_times",
"21-30_times","1-2_times","0_times","3-10_times","11-20_times")
)
I would like to create a second data frame to store the column names and a list/vector of unique values from the first dataframe.
Resulting in something like:
names vals
1 X1 0_times, 1-2_times, 3-10_times, 11-20_times
2 X2 ab, bb, cb, db, eb
3 X3 1-2_times, 3-10_times, 11-20_times , 21-30_times, more_than_30_times
4 X4 foo, bar, fizz, buzz, weee
1 X5 0_times,1-2_times,3-10_times,11-20_times,21-30_times
I create the second data frame using:
mydf2 <- data.frame(names = colnames(mydf))
mydf2$vals <- lapply(mydf, unique)
I think this is okay so far. However, the challenge I have is that I need vectors containing digits (in this case only mydf2$X1
) to be sorted in ascending order using more than just the first digit of each item.
With some great help from a Stack user, this was suggested as a means to sort the vectors containing digits and it works perfectly on a single vector:
mylist <- c('0_times','3-10_times','11_20_times','1-2_times','more_than_20_times')
o <- sapply(strsplit(mylist, '\\D '), function(x) min(as.numeric(x[nzchar(x)])))
mylist[order(o)]
When I attempt to apply this to the entire mydf2$vals
column by substituting in the column name:
o <- sapply(strsplit(mydf2$vals, '\\D '), function(x) min(as.numeric(x[nzchar(x)])))
mydf2$vals[order(o)]
I receive the error error in evaluating the argument 'X' in selecting a method for function 'sapply': non-character argument
So I have 2 questions:
- Is there a simpler way of achieving my aims?
- How can I amend the suggested sort function so the error doesn't occur?
CodePudding user response:
You could wrap @Allan Cameron's logic in a function, that extends it to sort the columns' unique
values. if
there are numbers included, which grepl
will tell us, we apply the logic. I called it makeLevels
because it might be useful to create factor levels as well.
makeLevels <- \(x) {
if (any(grepl('\\d', x))) {
unique(x[order(sapply(strsplit(x, '\\D '), function(x) min(as.numeric(x[nzchar(x)]))))])
} else {
sort(unique(x))
}
}
lapply(names(mydf), \(x) data.frame(names=x, vals=toString(makeLevels(mydf[[x]])))) |>
do.call(what=rbind)
# names vals
# 1 X1 0_times, 1-2_times, 3-10_times, 11-20_times
# 2 X2 ab, bb, cb, db, eb
# 3 X3 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times
# 4 X4 bar, buzz, fizz, foo, weee
# 5 X5 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times
Or do something like this,
lapply(names(mydf), \(x) sprintf('$ %s <%s>: %s', x, class(mydf[[x]]), toString(makeLevels(mydf[[x]])))) |>
do.call(what=rbind)
# [,1]
# [1,] "$ X1 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times"
# [2,] "$ X2 <character>: ab, bb, cb, db, eb"
# [3,] "$ X3 <character>: 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"
# [4,] "$ X4 <character>: bar, buzz, fizz, foo, weee"
# [5,] "$ X5 <character>: 0_times, 1-2_times, 3-10_times, 11-20_times, 21-30_times, more_than_30_times"
which is similar to what str(mydf)
does; you actually could lapply
makeLevels
and str
it, if you just wanna read it.
str(lapply(mydf, makeLevels), vec.len=10L)
# List of 5
# $ X1: chr [1:4] "0_times" "1-2_times" "3-10_times" "11-20_times"
# $ X2: chr [1:5] "ab" "bb" "cb" "db" "eb"
# $ X3: chr [1:5] "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"
# $ X4: chr [1:5] "bar" "buzz" "fizz" "foo" "weee"
# $ X5: chr [1:6] "0_times" "1-2_times" "3-10_times" "11-20_times" "21-30_times" "more_than_30_times"
...
CodePudding user response:
Perhaps just reset the order of mydf
(i.e. pasting a leading 0 to the X1
column):
mydf <- mydf[order(ifelse(substr(mydf$X1,1,2)!=11, paste0("0",mydf$X1), mydf$X1)),]
Then run the same code above for mydf2
, and you will get this:
names vals
1 X1 0_times, 1-2_times, 3-10_times, 11-20_times
2 X2 ab, db, bb, eb, cb
Is that the expected output?