I had to clean a data frame that has about a million rows. As a part of cleaning the data, I wanted to remove any trailing or leading whitespaces in the data frame. I ended up using trimws()
function. Here is the code
trimmed_merged_data <- merged_data %>% select(everything()) %>% trimws(which = "both")
I had two issues with this code. Firstly, I took several minutes to get done unlike the earlier instances where some other functions could run in a span of few ten seconds. Secondly and shockingly, the result I got was a single list of characters! I ran the code over the data frame that had 1 million rows across 13 columns but I happened to get just a single row! I am unable to wrap my mind around it.
So can anyone help me identify what the issue is and also, will it always take so long to trim the values in data frames. If so, what else should I do or use to reduce the time.
CodePudding user response:
In base R only, define a function and lapply
trimws
to each column of the input data.frame. It's not much slower than the dplyr
solution of akrun.
trimws_df <- function(x, ...){
x[] <- lapply(x, trimws, ...)
x
}
trimmed_merged_data <- trimws_df(merged_data)
CodePudding user response:
trimws
expects a vector. According to ?trimws
x - a character vector
Here, we may need across
to loop across
the columns and apply the trimws
individually on each column
library(dplyr)
trimmed_merged_data <- merged_data %>%
mutate(across(everything(), trimws, which = "both"))
CodePudding user response:
maybe the stringr::str_trim()
function can help you with a string or vector
Greetings