Merging two rows in R while appending a specific column of the first row with a string from the seco-CodePudding

I'm trying to tidy some archived OCR-files. One step includes to detect subheaders in the document. As some subheaders have a length of 2 lines, they are separated from the beginning of the respective header.

Example:

df <- data.frame(header = c("1. hello", "2. halo", "hallow"), line_id = c(28:30))

I want to delete the row without the digit at the beginning but paste the content of the header column after the content of the row above.

Expected result:

df_clean <- data.frame(header = c("1. hello", "2. halo hallow"), line_id = c(28,29))

CodePudding user response：

One approach might be to "group" by rows where the header begins with a number, and then combine those rows with paste. This would allow for multiple rows to be combined.

library(tidyverse)

df %>%
  group_by(grp = cumsum(grepl("^\\d .", header))) %>%
  summarise(header = paste(header, collapse = " "), line_id = first(line_id))

Output

    grp header         line_id
  <int> <chr>            <int>
1     1 1. hello            28
2     2 2. halo hallow      29