Home > Software design >  Trimming entries in a data frame using trimws() resulted in something unexpected
Trimming entries in a data frame using trimws() resulted in something unexpected

Time:11-23

I had to clean a data frame that has about a million rows. As a part of cleaning the data, I wanted to remove any trailing or leading whitespaces in the data frame. I ended up using trimws() function. Here is the code

trimmed_merged_data <- merged_data %>% select(everything()) %>% trimws(which = "both")

I had two issues with this code. Firstly, I took several minutes to get done unlike the earlier instances where some other functions could run in a span of few ten seconds. Secondly and shockingly, the result I got was a single list of characters! I ran the code over the data frame that had 1 million rows across 13 columns but I happened to get just a single row! I am unable to wrap my mind around it.

So can anyone help me identify what the issue is and also, will it always take so long to trim the values in data frames. If so, what else should I do or use to reduce the time.

CodePudding user response:

In base R only, define a function and lapply trimws to each column of the input data.frame. It's not much slower than the dplyr solution of akrun.

trimws_df <- function(x, ...){
  x[] <- lapply(x, trimws, ...)
  x
}

trimmed_merged_data <- trimws_df(merged_data)

CodePudding user response:

trimws expects a vector. According to ?trimws

x - a character vector

Here, we may need across to loop across the columns and apply the trimws individually on each column

library(dplyr)
trimmed_merged_data <- merged_data %>% 
      mutate(across(everything(),  trimws, which = "both"))

CodePudding user response:

maybe the stringr::str_trim() function can help you with a string or vector

Greetings

  •  Tags:  
  • r
  • Related