arrange contents in a dataframe extracted from docx in R-CodePudding

I have a document (.docx), found in the link below, which I have extracted the content using officer package.

using the code below, I have extracted the contents of this document.

doc <- read_docx("test.docx")
content <- docx_summary(doc)
head(content)

#To get all paragraphs:
par_data <- subset(content, content_type %in% "paragraph") 
par_data <- par_data[, c("doc_index", "style_name", 
                         "text") ]
par_data$text <- with(par_data, {
  substr(
    text, start = 1, 
    stop = ifelse(nchar(text)<30, nchar(text), 30) )
})
par_data

the dataframe can be reproduced using the following code.

par_data <- data.frame(doc_index = 1:21, 
                   style_name = c("heading 1", "heading 2", "heading 3",NA ,NA,NA, "heading 2", "heading 3", NA,NA,NA, NA,"heading 2", "heading 3", NA, NA, "heading 1", "heading 2","heading 3", NA,NA ), 
                   text = c(' Cardiovascular drugs ', ' ACE inhibitors. ', ' Valsartan ', ' Valsartan is used to treat hig ', ' Side effects ', ' high potassium; headache, dizz ', ' Beta blockers. ', ' propranolol ', ' Propranolol is prescribed for  ', ' Side effects ', ' slow or uneven heartbeats', ' wheezing or trouble breathing ', ' Calcium channel blockers. ', ' Nifedipine ', ' Side effects ', ' Bloating or swelling of the fa ', ' Neurological drugs ', ' Anticonvulsants ', ' Phenytoin  ', ' Side effects ', ' Decreased coordination, mental '))

what I need is to reshape this dataframe to have something like this:

In fact, I need headings 1 and 2 as columns where each drug (which are all heading 3s) get the text of the last heading in these columns. also, I need two other columns. Some drugs have descriptions and then side effects and others just have side effects, which are in the rows before the next heading 1 or 2 or 3 comes. Is there a straightforward way to accomplish this? Any help is appreciated.

CodePudding user response：

This is a bit more than just reshaping, requiring some inference based on previous text and style_name values, plus "last observation carry-forward" (locf). The data also has blank space at the beginning/end of strings, so I'll clean them up with trimws.

dplyr

I think this does what you want:

library(dplyr)
# library(tidyr) # fill
par_data %>%
  mutate(across(where(is.character), trimws)) %>%
  mutate(
    grp = cumsum(is.na(lag(style_name)) & !is.na(style_name)),
    style_name = case_when(
      is.na(style_name) & lag(text) == "Side effects" ~ "sideeffects",
      is.na(style_name) & lag(style_name) == "heading 3" &
        !text %in% "Side effects" ~ "description",
      TRUE ~ style_name)
  ) %>%
  filter(!is.na(style_name)) %>%
  pivot_wider(grp, names_from = "style_name", values_from = "text") %>%
  tidyr::fill(`heading 1`)
# # A tibble: 4 x 6
#     grp `heading 1`          `heading 2`               `heading 3` description                    sideeffects
#   <int> <chr>                <chr>                     <chr>       <chr>                          <chr>      
# 1     1 Cardiovascular drugs ACE inhibitors.           Valsartan   Valsartan is used to treat hig high potas~
# 2     2 Cardiovascular drugs Beta blockers.            propranolol Propranolol is prescribed for  slow or un~
# 3     3 Cardiovascular drugs Calcium channel blockers. Nifedipine  NA                             Bloating o~
# 4     4 Neurological drugs   Anticonvulsants           Phenytoin   NA                             Decreased ~

This could be done in other than tidyverse, though it'll still benefit from an external package function (reshape2::dcast) ... stats::reshape can be a bit of a chore to work with.

data.table

If you are already using (or considering) data.table, this is the rough equivalent to the above:

library(data.table)
chrs <- which(sapply(par_data, is.character))
as.data.table(par_data)[, c(chrs) := lapply(.SD, trimws), .SDcols = chrs
  ][, grp := cumsum(is.na(shift(style_name)) & !is.na(style_name))
    ][, style_name := fcase(
        is.na(style_name) & shift(text) == "Side effects", "sideeffects",
        is.na(style_name) & lag(style_name) == "heading 3" &
          !text %in% "Side effects", "description",
        rep(TRUE, .N),  style_name)
      ][!is.na(style_name),
        ][, dcast(grp ~ style_name, value.var = "text", data = .SD)
          ][, `heading 1` := zoo::na.locf(`heading 1`)
            ][, .(`heading 1`, `heading 2`, `heading 3`, description, sideeffects) ]
#               heading 1                 heading 2   heading 3                    description                    sideeffects
# 1: Cardiovascular drugs           ACE inhibitors.   Valsartan Valsartan is used to treat hig high potassium; headache, dizz
# 2: Cardiovascular drugs            Beta blockers. propranolol  Propranolol is prescribed for      slow or uneven heartbeats
# 3: Cardiovascular drugs Calcium channel blockers.  Nifedipine                           <NA> Bloating or swelling of the fa
# 4:   Neurological drugs           Anticonvulsants   Phenytoin                           <NA> Decreased coordination, mental