Conditional string concatenation in same column in R-CodePudding

I am new to R and have a very large irregular column in a data frame like this:

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

section
BOOK I: Introduction
Page one: presentation
Page two: acknowledgments
MAGAZINE II: Considerations 
Page one: characters
Page two: index
BOOK III: General principles
BOOK III: General principles
Page one: invitation

I need to concatenate this column to look like this:

section
BOOK I: Introduction 
BOOK I: Introduction / Page one: presentation
BOOK I: Introduction / Page two: acknowledgments
MAGAZINE II: Considerations
MAGAZINE II: Considerations / Page one: characters
MAGAZINE II: Considerations / Page two: index
BOOK III: General Principles
BOOK III: General Principles
BOOK III: General Principles / Page one: invitation

Basically the goal is to extract the value of the upper string based in a condition and then concatenate with the lower actualizing the value with a regex expression, but I really don't know how to do it.

Thanks in advance.

CodePudding user response：

Here is one method:

x <- data.frame(section = c("BOOK I: Introduction", "Page one: presentation", "Page two: acknowledgments", "MAGAZINE II: Considerations", "Page one: characters", "Page two: index", "BOOK III: General Principles", "BOOK III: General Principles", "Page one: invitation"))

x <- dplyr::mutate(x,
  isSection = stringr::str_starts(section, "Page", negate = TRUE),
  sectionNum = cumsum(isSection)
) |> 
  dplyr::group_by(sectionNum) |> 
  dplyr::mutate(newSection = dplyr::if_else(
    condition = isSection, 
    true = section, 
    false = paste(dplyr::first(section), section, sep = " / ")
  )) |>
  ungroup()

x
#> # A tibble: 9 × 4
#>   section                      isSection sectionNum newSection                  
#>   <chr>                        <lgl>          <int> <chr>                       
#> 1 BOOK I: Introduction         TRUE               1 BOOK I: Introduction        
#> 2 Page one: presentation       FALSE              1 BOOK I: Introduction / Page…
#> 3 Page two: acknowledgments    FALSE              1 BOOK I: Introduction / Page…
#> 4 MAGAZINE II: Considerations  TRUE               2 MAGAZINE II: Considerations 
#> 5 Page one: characters         FALSE              2 MAGAZINE II: Considerations…
#> 6 Page two: index              FALSE              2 MAGAZINE II: Considerations…
#> 7 BOOK III: General Principles TRUE               3 BOOK III: General Principles
#> 8 BOOK III: General Principles TRUE               4 BOOK III: General Principles
#> 9 Page one: invitation         FALSE              4 BOOK III: General Principle…

^{Created on 2022-03-25 by the reprex package (v2.0.1)}

Here, we first determine if the section is a section title or a page title and save that as TRUE or FALSE.

Then, we label the pages belonging to a section by using cumsum() (cumulative sum). When we add up TRUE and FALSE values, TRUE (here, sections) become 1 and increment the cumulative sum, but FALSE (here, pages) become 0 and don't increment the cumulative sum, so all of the pages within a specific section receive the same value.

Lastly, we make a new section variable, this time using group_by() and if_else() to conditionally set the value. If isSection is TRUE, we just keep the existing value of section (the section title). If isSection is FALSE, we concatenate the first value of section from the group with the existing value of section, separated by " / ".

CodePudding user response：

using data.table:

library(data.table)

setDT(x)[grepl("^Page.",section)==F, header:=section] %>% 
  .[,header:=zoo::na.locf(header)] %>% 
  .[section!=header,header:=paste0(header, " / ",section)] %>% 
  .[,.(section = header)] %>% 
  .[]

1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation

CodePudding user response：

A rolling join could achieve this. In data.table:


library( data.table )

# add a row column for joining by reference
x[ , row := .I ]

# pick out just the title rows. It looks like these start with either "BOOK" or "MAGAZINE"
books_magazines <- x[ grepl("^BOOK|^MAGAZINE", section),
                      .(row, book_magazine = section) ]

# join the 2 tables, using a rolling join to add the title row to subsequent rows
both_cols <- books_magazines[ x, on = .(row), roll = TRUE ]

# concatenate the 2 columns together where necessary, leave it alone if it's the title row
result <- both_cols[ , .(
    section_string = fifelse( book_magazine == section,
                              book_magazine,
                              sprintf("%s / %s", book_magazine, section) )
) ]

This gives:

> result$section_string

[1] "BOOK I: Introduction"                               
[2] "BOOK I: Introduction / Page one: presentation"      
[3] "BOOK I: Introduction / Page two: acknowledgments"   
[4] "MAGAZINE II: Considerations"                        
[5] "MAGAZINE II: Considerations / Page one: characters" 
[6] "MAGAZINE II: Considerations / Page two: index"      
[7] "BOOK III: General Principles"                       
[8] "BOOK III: General Principles"                       
[9] "BOOK III: General Principles / Page one: invitation"

CodePudding user response：

You can do:

unlist(lapply(split(x$section, cumsum(grepl('^[A-Z]{3}', x$section))), 
              function(y) {
                  if(length(y) == 1) return(y)
                  else c(y[1], paste(y[1], y[-1], sep = " / "))
                }), use.names = FALSE)
#> [1] "BOOK I: Introduction"                               
#> [2] "BOOK I: Introduction / Page one: presentation"      
#> [3] "BOOK I: Introduction / Page two: acknowledgments"   
#> [4] "MAGAZINE II: Considerations"                        
#> [5] "MAGAZINE II: Considerations / Page one: characters" 
#> [6] "MAGAZINE II: Considerations / Page two: index"      
#> [7] "BOOK III: General Principles"                       
#> [8] "BOOK III: General Principles"                       
#> [9] "BOOK III: General Principles / Page one: invitation"

CodePudding user response：

An slightly simpler data.table approach:

library(data.table)
setDT(x)

x[, g := cumsum(grepl('(BOOK|MAGAZINE)', section))]
x[, section := ifelse(seq_along(section) == 1,
    section, paste(section[1], section, sep = ' / ')), by = .(g)]
x[, g := NULL]

The output is:

> x
                                               section
1:                                BOOK I: Introduction
2:       BOOK I: Introduction / Page one: presentation
3:    BOOK I: Introduction / Page two: acknowledgments
4:                         MAGAZINE II: Considerations
5:  MAGAZINE II: Considerations / Page one: characters
6:       MAGAZINE II: Considerations / Page two: index
7:                        BOOK III: General Principles
8:                        BOOK III: General Principles
9: BOOK III: General Principles / Page one: invitation