Home > Software engineering >  Split strings into separate rows excluding some pattern matches
Split strings into separate rows excluding some pattern matches

Time:11-10

I have a data.frame for which I would like to separate the IV column into separate rows for every piece of text separated by a comma "," excluding those pieces of text that feature commas between parentheses e.g. ",text (string, string, string),".

Example of the current data:

structure(list(Article.Title = "Random title", 
    Sample = "Sample information", 
    IV = "Union voice, HRM practices (participation, teams, incentives, development, recruitment), implict contracts, Crisis impact, dominant individual or family owner, no dominant individual or family owner, market growth, no market growth,", 
    Moderator = NA_character_, Mediator = NA_character_, DV = "Performance"), row.names = c(NA, 
-1L), class = c("tbl_df", "tbl", "data.frame"))

Expected result:

structure(list(Article.Title = c("Random title", "Random title", 
"Random title", "Random title", "Random title", "Random title", 
"Random title", "Random title"), Sample = c("Sample information", 
"Sample information", "Sample information", "Sample information", 
"Sample information", "Sample information", "Sample information", 
"Sample information"), IV = c("Union voice", "HRM practices (participation, teams, incentives, development, recruitment)", 
"implict contracts", "Crisis impact", "dominant individual or family owner", 
"no dominant individual or family owner", "market growth", "no market growth"
), Moderator = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"
), Mediator = c("NA", "NA", "NA", "NA", "NA", "NA", "NA", "NA"
), DV = c("Performance", "Performance", "Performance", "Performance", 
"Performance", "Performance", "Performance", "Performance")), class = "data.frame", row.names = c(NA, 
-8L))

CodePudding user response:

We could do this in base R with strsplit by splitting the 'IV' column at the , while SKIPping the characters inside the parentheses, and then replicate the rows if the data by the lengths of the list created with strsplit

lst1 <-  strsplit(df1$IV, "\\([^)] (*SKIP)(*FAIL)|,\\s*", perl = TRUE)
df2 <- transform(df1[setdiff(names(df1), "IV")][rep(seq_len(nrow(df1)), 
        lengths(lst1)),], IV = unlist(lst1))[names(df1)]

-output

> df2
  Article.Title             Sample                                                                         IV Moderator Mediator          DV
1  Random title Sample information                                                                Union voice      <NA>     <NA> Performance
2  Random title Sample information HRM practices (participation, teams, incentives, development, recruitment)      <NA>     <NA> Performance
3  Random title Sample information                                                          implict contracts      <NA>     <NA> Performance
4  Random title Sample information                                                              Crisis impact      <NA>     <NA> Performance
5  Random title Sample information                                        dominant individual or family owner      <NA>     <NA> Performance
6  Random title Sample information                                     no dominant individual or family owner      <NA>     <NA> Performance
7  Random title Sample information                                                              market growth      <NA>     <NA> Performance
8  Random title Sample information                                                           no market growth      <NA>     <NA> Performance

Or use the same regex in separate_rows (as in the comments)

library(tidyr)
separate_rows(df1, IV, sep = "\\([^)] (*SKIP)(*FAIL)|,\\s*")

-output

# A tibble: 9 × 6
  Article.Title Sample             IV                                                                           Moderator Mediator DV         
  <chr>         <chr>              <chr>                                                                        <chr>     <chr>    <chr>      
1 Random title  Sample information "Union voice"                                                                <NA>      <NA>     Performance
2 Random title  Sample information "HRM practices (participation, teams, incentives, development, recruitment)" <NA>      <NA>     Performance
3 Random title  Sample information "implict contracts"                                                          <NA>      <NA>     Performance
4 Random title  Sample information "Crisis impact"                                                              <NA>      <NA>     Performance
5 Random title  Sample information "dominant individual or family owner"                                        <NA>      <NA>     Performance
6 Random title  Sample information "no dominant individual or family owner"                                     <NA>      <NA>     Performance
7 Random title  Sample information "market growth"                                                              <NA>      <NA>     Performance
8 Random title  Sample information "no market growth"                                                           <NA>      <NA>     Performance
9 Random title  Sample information ""                                                                           <NA>      <NA>     Performance
  •  Tags:  
  • r
  • Related