I'm trying to tidy some archived OCR-files. One step includes to detect subheaders in the document. As some subheaders have a length of 2 lines, they are separated from the beginning of the respective header.
Example:
df <- data.frame(header = c("1. hello", "2. halo", "hallow"), line_id = c(28:30))
I want to delete the row without the digit at the beginning but paste the content of the header column after the content of the row above.
Expected result:
df_clean <- data.frame(header = c("1. hello", "2. halo hallow"), line_id = c(28,29))
CodePudding user response:
One approach might be to "group" by rows where the header
begins with a number, and then combine those rows with paste
. This would allow for multiple rows to be combined.
library(tidyverse)
df %>%
group_by(grp = cumsum(grepl("^\\d .", header))) %>%
summarise(header = paste(header, collapse = " "), line_id = first(line_id))
Output
grp header line_id
<int> <chr> <int>
1 1 1. hello 28
2 2 2. halo hallow 29