I often have the problem that R converts my one column data frames into character vectors, which I solve by using the drop=FALSE
option.
However, there are some instances where I do not know how to put a solution to this kind of behavior in R, and this is one of them.
I have a data frame like the following:
mydf <- data.frame(ID=LETTERS[1:3], value1=paste(LETTERS[1:3], 1:3), value2=paste(rev(LETTERS)[1:3], 1:3))
that looks like:
> mydf
ID value1 value2
1 A A 1 Z 1
2 B B 2 Y 2
3 C C 3 X 3
The task I am doing here, is to replace spaces by _
in every column except the first, and I want to use an apply
family function for this, sapply
in this case.
I do the following:
new_df <- as.data.frame(sapply(mydf[,-1,drop=F], function(x) gsub("\\s ","_",x)))
new_df <- cbind(mydf[,1,drop=F], new_df)
The resulting data frame looks exactly how I want it:
> new_df
ID value1 value2
1 A A_1 Z_1
2 B B_2 Y_2
3 C C_3 X_3
My problem starts with some rare cases where my input can have one row of data only. For some reason I never understood, R has a completely different behavior in these cases, but no drop=FALSE
option can save me here...
My input data frame now is:
mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))
which looks like:
> mydf
ID value1 value2
1 A A 1 Z 1
However, when I apply the same code, my resulting data frame looks hideous like this:
> new_df
ID sapply(mydf[, -1, drop = F], function(x) gsub("\\\\s ", "_", x))
value1 A A_1
value2 A Z_1
How to solve this issue so that the same line of code gives me the same kind of result for input data frames of any number of rows?
A deeper question would be why on earth does R do this? I keep going back to my codes when I have some new weird inputs with one row/column cause they break everything... Thanks!
CodePudding user response:
You can solve your problem by using lapply
instead of sapply
, and then combine the result using do.call
as follows
new_df <- as.data.frame(lapply(mydf[,-1,drop=F], function(x) gsub("\\s ","_",x)))
new_df <- do.call(cbind, new_df)
new_df
# value1 value2
#[1,] "A_1" "Z_1"
new_df <- cbind(mydf[,1,drop=F], new_df)
#new_df
# ID value1 value2
#1 A A_1 Z_1
As for your question about unpredictable behavior of sapply
, it is because s
in sapply
represent simplification, but the simplified result is not guaranteed to be a data frame. It can be a data frame, a matrix, or a vector.
According to the documentation of sapply
:
sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array().
On the simplify
argument:
logical or character string; should the result be simplified to a vector, matrix or higher dimensional array if possible? For sapply it must be named and not abbreviated. The default value, TRUE, returns a vector or matrix if appropriate, whereas if simplify = "array" the result may be an array of “rank” (=length(dim(.))) one higher than the result of FUN(X[[i]]).
The Details part explain its behavior that loos similar with what you experienced (emphasis is from me) :
Simplification in sapply is only attempted if X has length greater than zero and if the return values from all elements of X are all of the same (positive) length. If the common length is one the result is a vector, and if greater than one is a matrix with a column corresponding to each element of X.
Hadley Wickham also recommend not to use sapply
:
I recommend that you avoid sapply() because it tries to simplify the result, so it can return a list, a vector, or a matrix. This makes it difficult to program with, and it should be avoided in non-interactive settings
He also recommends not to use apply
with a data frame. See Advanced R for further explanation.
CodePudding user response:
You can also use map_df
function from purrr
package, which applies a function on each element of an object and also returns a data frame:
library(dplyr)
library(purrr)
mydf %>%
mutate(map_df(select(cur_data(), starts_with("value")), ~ gsub("\\s", "_", .x)))
ID value1 value2
1 A A_1 Z_1
And with the original data frame:
ID value1 value2
1 A A_1 Z_1
2 B B_2 Y_2
3 C C_3 X_3
CodePudding user response:
Here's a solution that replaces the original data. Not sure if this is plays into your workflow, though. Notice that I used apply
which is used to process data.frames by rows or columns.
mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))
xy <- apply(X = mydf[, -1, drop = FALSE],
MARGIN = 2,
FUN = function(x) gsub("\\s ", "_", x),
simplify = FALSE
)
xy <- do.call(cbind, xy)
xy <- as.data.frame(xy)
mydf[, -1] <- as.data.frame(xy)
mydf
ID value1 value2
1 A A_1 Z_1