I have a document (.docx), found in the link below, which I have extracted the content using officer package.
using the code below, I have extracted the contents of this document.
doc <- read_docx("test.docx")
content <- docx_summary(doc)
head(content)
#To get all paragraphs:
par_data <- subset(content, content_type %in% "paragraph")
par_data <- par_data[, c("doc_index", "style_name",
"text") ]
par_data$text <- with(par_data, {
substr(
text, start = 1,
stop = ifelse(nchar(text)<30, nchar(text), 30) )
})
par_data
the dataframe can be reproduced using the following code.
par_data <- data.frame(doc_index = 1:21,
style_name = c("heading 1", "heading 2", "heading 3",NA ,NA,NA, "heading 2", "heading 3", NA,NA,NA, NA,"heading 2", "heading 3", NA, NA, "heading 1", "heading 2","heading 3", NA,NA ),
text = c(' Cardiovascular drugs ', ' ACE inhibitors. ', ' Valsartan ', ' Valsartan is used to treat hig ', ' Side effects ', ' high potassium; headache, dizz ', ' Beta blockers. ', ' propranolol ', ' Propranolol is prescribed for ', ' Side effects ', ' slow or uneven heartbeats', ' wheezing or trouble breathing ', ' Calcium channel blockers. ', ' Nifedipine ', ' Side effects ', ' Bloating or swelling of the fa ', ' Neurological drugs ', ' Anticonvulsants ', ' Phenytoin ', ' Side effects ', ' Decreased coordination, mental '))
what I need is to reshape this dataframe to have something like this:
In fact, I need headings 1 and 2 as columns where each drug (which are all heading 3s) get the text of the last heading in these columns. also, I need two other columns. Some drugs have descriptions and then side effects and others just have side effects, which are in the rows before the next heading 1 or 2 or 3 comes. Is there a straightforward way to accomplish this? Any help is appreciated.
CodePudding user response:
This is a bit more than just reshaping, requiring some inference based on previous text
and style_name
values, plus "last observation carry-forward" (locf). The data also has blank space at the beginning/end of strings, so I'll clean them up with trimws
.
dplyr
I think this does what you want:
library(dplyr)
# library(tidyr) # fill
par_data %>%
mutate(across(where(is.character), trimws)) %>%
mutate(
grp = cumsum(is.na(lag(style_name)) & !is.na(style_name)),
style_name = case_when(
is.na(style_name) & lag(text) == "Side effects" ~ "sideeffects",
is.na(style_name) & lag(style_name) == "heading 3" &
!text %in% "Side effects" ~ "description",
TRUE ~ style_name)
) %>%
filter(!is.na(style_name)) %>%
pivot_wider(grp, names_from = "style_name", values_from = "text") %>%
tidyr::fill(`heading 1`)
# # A tibble: 4 x 6
# grp `heading 1` `heading 2` `heading 3` description sideeffects
# <int> <chr> <chr> <chr> <chr> <chr>
# 1 1 Cardiovascular drugs ACE inhibitors. Valsartan Valsartan is used to treat hig high potas~
# 2 2 Cardiovascular drugs Beta blockers. propranolol Propranolol is prescribed for slow or un~
# 3 3 Cardiovascular drugs Calcium channel blockers. Nifedipine NA Bloating o~
# 4 4 Neurological drugs Anticonvulsants Phenytoin NA Decreased ~
This could be done in other than tidyverse, though it'll still benefit from an external package function (reshape2::dcast
) ... stats::reshape
can be a bit of a chore to work with.
data.table
If you are already using (or considering) data.table
, this is the rough equivalent to the above:
library(data.table)
chrs <- which(sapply(par_data, is.character))
as.data.table(par_data)[, c(chrs) := lapply(.SD, trimws), .SDcols = chrs
][, grp := cumsum(is.na(shift(style_name)) & !is.na(style_name))
][, style_name := fcase(
is.na(style_name) & shift(text) == "Side effects", "sideeffects",
is.na(style_name) & lag(style_name) == "heading 3" &
!text %in% "Side effects", "description",
rep(TRUE, .N), style_name)
][!is.na(style_name),
][, dcast(grp ~ style_name, value.var = "text", data = .SD)
][, `heading 1` := zoo::na.locf(`heading 1`)
][, .(`heading 1`, `heading 2`, `heading 3`, description, sideeffects) ]
# heading 1 heading 2 heading 3 description sideeffects
# 1: Cardiovascular drugs ACE inhibitors. Valsartan Valsartan is used to treat hig high potassium; headache, dizz
# 2: Cardiovascular drugs Beta blockers. propranolol Propranolol is prescribed for slow or uneven heartbeats
# 3: Cardiovascular drugs Calcium channel blockers. Nifedipine <NA> Bloating or swelling of the fa
# 4: Neurological drugs Anticonvulsants Phenytoin <NA> Decreased coordination, mental