I am in the process of creating a corpus of textbooks. Along with the actual sentences, there are some metadata columns, including the type of text the sentence is (for example, it is from the main body of texts, a text box, a figure, a table, or an activity).
Because of how the original text was tagged after it was scanned, I can easily mark where the different text types start and end, but I need to fill in the information for the sentences inside those two tags.
For example, I have (I have replaced the actual sentences with the word Sentences so that it fits the page):
chpt | page_num | paragraph | type | text_type | text |
---|---|---|---|---|---|
2 | 9 | 11 | main | text | Sentences |
2 | 9 | 12 | main | text | Sentences |
2 | 9 | 14 | text_box_start | header | Sentences |
2 | 9 | 15 | main | text | Sentences |
2 | 9 | 16 | main | text | Sentences |
2 | 9 | 17 | text_box_end | text | Sentences |
2 | 10 | 19 | main | text | Sentences |
2 | 10 | 20 | main | text | Sentences |
I want (the only thing that changes are the data in rows 3 to 6 of the type column):
chpt | page_num | paragraph | type | text_type | text |
---|---|---|---|---|---|
2 | 9 | 11 | main | text | Sentences |
2 | 9 | 12 | main | text | Sentences |
2 | 9 | 14 | text_box | header | Sentences |
2 | 9 | 15 | text_box | text | Sentences |
2 | 9 | 16 | text_box | text | Sentences |
2 | 9 | 17 | text_box | text | Sentences |
2 | 10 | 19 | main | text | Sentences |
2 | 10 | 20 | main | text | Sentences |
I imagine it would be possible to use a for and a couple of if/then loops to iterate over the "type" row, but I was wondering if there is an easier way to replace the value of all the rows between "text_box_start" and "text_box_end" in the df above with "text_box".
I am using R with the Tidyverse packages installed, so if anyone has a suggestion for a solution using base R or one of the Tidyverse packages, that would be greatly appreciated.
CodePudding user response:
library(tidyverse)
df %>%
mutate(type1 = na_if(type, 'main')) %>%
fill(type1) %>%
mutate(type1 = coalesce(na_if(type1, 'text_box_end'), type),
type1 = recode(type1, text_box_end = 'text_box_start'))
chpt page_num paragraph type text_type text type1
1 2 9 11 main text Sentences main
2 2 9 12 main text Sentences main
3 2 9 14 text_box_start header Sentences text_box_start
4 2 9 15 main text Sentences text_box_start
5 2 9 16 main text Sentences text_box_start
6 2 9 17 text_box_end text Sentences text_box_start
7 2 10 19 main text Sentences main
8 2 10 20 main text Sentences main
CodePudding user response:
One option is to change type to "textbox" if it contains "start" or "end" and cumulatively count the number of starts/ends and if it's an odd number (i.e. it lies between a 'start' and an 'end') change it to "textbox", i.e.
library(tidyverse)
df <- read.table(text = "chpt page_num paragraph type text_type text
2 9 11 main text Sentences
2 9 12 main text Sentences
2 9 14 text_box_start header Sentences
2 9 15 main text Sentences
2 9 16 main text Sentences
2 9 17 text_box_end text Sentences
2 10 19 main text Sentences
2 10 20 main text Sentences",
header = TRUE)
# Tidyverse
df %>%
mutate(type = ifelse(str_detect(type, "start|end") |
cumsum(str_detect(type, "start|end")) %% 2 == 1,
"textbox", type))
#> chpt page_num paragraph type text_type text
#> 1 2 9 11 main text Sentences
#> 2 2 9 12 main text Sentences
#> 3 2 9 14 textbox header Sentences
#> 4 2 9 15 textbox text Sentences
#> 5 2 9 16 textbox text Sentences
#> 6 2 9 17 textbox text Sentences
#> 7 2 10 19 main text Sentences
#> 8 2 10 20 main text Sentences
# Base r
df$type <- ifelse(grepl("start|end", df$type) |
cumsum(grepl("start|end", df$type)) %% 2 == 1,
"textbox", df$type)
df
#> chpt page_num paragraph type text_type text
#> 1 2 9 11 main text Sentences
#> 2 2 9 12 main text Sentences
#> 3 2 9 14 textbox header Sentences
#> 4 2 9 15 textbox text Sentences
#> 5 2 9 16 textbox text Sentences
#> 6 2 9 17 textbox text Sentences
#> 7 2 10 19 main text Sentences
#> 8 2 10 20 main text Sentences
Created on 2022-08-24 by the reprex package (v2.0.1)
CodePudding user response:
Assuming, all text boxes have start and end (as it's usually in valid HTML) you can grep
for 'text_box'
and use a 2-column matrix
which will give you the edges of respective row seq
uences to change to 'text_box'
.
dat[apply(matrix(grep('text_box', dat$type), 2), 2, \(x) do.call(seq, as.list(x))), 'type'] <- 'text_box'
dat
# chpt page_num paragraph type text_type text
# 1 2 9 11 main text Sentences
# 2 2 9 12 main text Sentences
# 3 2 9 14 text_box header Sentences
# 4 2 9 15 text_box text Sentences
# 5 2 9 16 text_box text Sentences
# 6 2 9 17 text_box text Sentences
# 7 2 10 19 main text Sentences
# 8 2 10 20 main text Sentences
# 9 2 9 11 main text Sentences
# 10 2 9 12 main text Sentences
# 11 2 9 14 text_box header Sentences
# 12 2 9 15 text_box text Sentences
# 13 2 9 16 text_box text Sentences
# 14 2 9 17 text_box text Sentences
# 15 2 10 19 main text Sentences
# 16 2 10 20 main text Sentences