R: avoid turning one-row data frames into a vector when using apply functions-CodePudding

I often have the problem that R converts my one column data frames into character vectors, which I solve by using the drop=FALSE option.

However, there are some instances where I do not know how to put a solution to this kind of behavior in R, and this is one of them.

I have a data frame like the following:

mydf <- data.frame(ID=LETTERS[1:3], value1=paste(LETTERS[1:3], 1:3), value2=paste(rev(LETTERS)[1:3], 1:3))

that looks like:

> mydf
  ID value1 value2
1  A    A 1    Z 1
2  B    B 2    Y 2
3  C    C 3    X 3

The task I am doing here, is to replace spaces by _ in every column except the first, and I want to use an apply family function for this, sapply in this case.

I do the following:

new_df <- as.data.frame(sapply(mydf[,-1,drop=F], function(x) gsub("\\s ","_",x)))
new_df <- cbind(mydf[,1,drop=F], new_df)

The resulting data frame looks exactly how I want it:

> new_df
  ID value1 value2
1  A    A_1    Z_1
2  B    B_2    Y_2
3  C    C_3    X_3

My problem starts with some rare cases where my input can have one row of data only. For some reason I never understood, R has a completely different behavior in these cases, but no drop=FALSE option can save me here...

My input data frame now is:

mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))

which looks like:

> mydf
  ID value1 value2
1  A    A 1    Z 1

However, when I apply the same code, my resulting data frame looks hideous like this:

> new_df
       ID sapply(mydf[, -1, drop = F], function(x) gsub("\\\\s ", "_", x))
value1  A                                                              A_1
value2  A                                                              Z_1

How to solve this issue so that the same line of code gives me the same kind of result for input data frames of any number of rows?

A deeper question would be why on earth does R do this? I keep going back to my codes when I have some new weird inputs with one row/column cause they break everything... Thanks!

CodePudding user response：

You can solve your problem by using lapply instead of sapply, and then combine the result using do.call as follows

new_df <- as.data.frame(lapply(mydf[,-1,drop=F], function(x) gsub("\\s ","_",x)))
new_df <- do.call(cbind, new_df)
new_df
#     value1 value2
#[1,] "A_1"  "Z_1" 

new_df <- cbind(mydf[,1,drop=F], new_df)
#new_df
#  ID value1 value2
#1  A    A_1    Z_1

As for your question about unpredictable behavior of sapply, it is because s in sapply represent simplification, but the simplified result is not guaranteed to be a data frame. It can be a data frame, a matrix, or a vector.

According to the documentation of sapply:

sapply is a user-friendly version and wrapper of lapply by default returning a vector, matrix or, if simplify = "array", an array if appropriate, by applying simplify2array().

On the simplify argument:

logical or character string; should the result be simplified to a vector, matrix or higher dimensional array if possible? For sapply it must be named and not abbreviated. The default value, TRUE, returns a vector or matrix if appropriate, whereas if simplify = "array" the result may be an array of “rank” (=length(dim(.))) one higher than the result of FUN(X[[i]]).

The Details part explain its behavior that loos similar with what you experienced (emphasis is from me) :

Simplification in sapply is only attempted if X has length greater than zero and if the return values from all elements of X are all of the same (positive) length. If the common length is one the result is a vector, and if greater than one is a matrix with a column corresponding to each element of X.

Hadley Wickham also recommend not to use sapply:

I recommend that you avoid sapply() because it tries to simplify the result, so it can return a list, a vector, or a matrix. This makes it difficult to program with, and it should be avoided in non-interactive settings

He also recommends not to use apply with a data frame. See Advanced R for further explanation.

CodePudding user response：

You can also use map_df function from purrr package, which applies a function on each element of an object and also returns a data frame:

library(dplyr)
library(purrr)

mydf %>%
  mutate(map_df(select(cur_data(), starts_with("value")), ~ gsub("\\s", "_", .x)))

  ID value1 value2
1  A    A_1    Z_1

And with the original data frame:

  ID value1 value2
1  A    A_1    Z_1
2  B    B_2    Y_2
3  C    C_3    X_3

CodePudding user response：

Here's a solution that replaces the original data. Not sure if this is plays into your workflow, though. Notice that I used apply which is used to process data.frames by rows or columns.

mydf <- data.frame(ID=LETTERS[1], value1=paste(LETTERS[1], 1), value2=paste(rev(LETTERS)[1], 1))

xy <- apply(X = mydf[, -1, drop = FALSE],
      MARGIN = 2,
      FUN = function(x) gsub("\\s ", "_", x),
      simplify = FALSE
)
xy <- do.call(cbind, xy)
xy <- as.data.frame(xy)

mydf[, -1] <- as.data.frame(xy)
mydf

  ID value1 value2
1  A    A_1    Z_1