Home > Software engineering >  str_split on strings with quotes and commas
str_split on strings with quotes and commas

Time:10-08

I'm working with some messy categorical data which I'm trying to split out in separate cells. The data is split by commas, but in some outputs, there are commas within the categories themselves (such as categories 4 and 5 in the example below). In those instances the category is surrounded by quotes.

category 1, category 2, category 3, "category, 4"
category 1, category 2, "category, 4", "category, 5"

The desired output would split the data into separate cells such as the table below

category 1 category 2 category 3 category, 4
category 1 category 2 category, 4 category, 5

I've tried using str_split, but am unsure how to separate the categories without splitting the answers themselves.

CodePudding user response:

data.table's fread does all this automatically for you:

dt <- data.table::fread(string)

CodePudding user response:

I suggest simply reimporting this data using the read.csv tool. The default quoting character is ", so it will automatically be able to ignore the commas inside doubly quoted terms. Assuming you already have this data frame in R, you could try:

write.csv(df, file="output.csv", col.names=FALSE)
df_new <- read.csv(file="output.csv", header=FALSE)
names(df_new) <- c("category 1", "category 2", "category 3", "category 4")

CodePudding user response:

We may also use read.csv with textConnection

df <- read.csv(textConnection('category 1, category 2, category 3, "category, 4"
          category 1, category 2, "category, 4", "category, 5"'), 
     check.names = FALSE, strip.white = TRUE)

-structure

> str(df)
'data.frame':   1 obs. of  4 variables:
 $ category 1 : chr "category 1"
 $ category 2 : chr "category 2"
 $ category 3 : chr "category, 4"
 $ category, 4: chr "category, 5"
  • Related