Is there an easy way to fill in between 2 values in R?-CodePudding

I am in the process of creating a corpus of textbooks. Along with the actual sentences, there are some metadata columns, including the type of text the sentence is (for example, it is from the main body of texts, a text box, a figure, a table, or an activity).

Because of how the original text was tagged after it was scanned, I can easily mark where the different text types start and end, but I need to fill in the information for the sentences inside those two tags.

For example, I have (I have replaced the actual sentences with the word Sentences so that it fits the page):

chpt	page_num	paragraph	type	text_type	text
2	9	11	main	text	Sentences
2	9	12	main	text	Sentences
2	9	14	text_box_start	header	Sentences
2	9	15	main	text	Sentences
2	9	16	main	text	Sentences
2	9	17	text_box_end	text	Sentences
2	10	19	main	text	Sentences
2	10	20	main	text	Sentences

I want (the only thing that changes are the data in rows 3 to 6 of the type column):

chpt	page_num	paragraph	type	text_type	text
2	9	11	main	text	Sentences
2	9	12	main	text	Sentences
2	9	14	text_box	header	Sentences
2	9	15	text_box	text	Sentences
2	9	16	text_box	text	Sentences
2	9	17	text_box	text	Sentences
2	10	19	main	text	Sentences
2	10	20	main	text	Sentences

I imagine it would be possible to use a for and a couple of if/then loops to iterate over the "type" row, but I was wondering if there is an easier way to replace the value of all the rows between "text_box_start" and "text_box_end" in the df above with "text_box".

I am using R with the Tidyverse packages installed, so if anyone has a suggestion for a solution using base R or one of the Tidyverse packages, that would be greatly appreciated.

CodePudding user response：

library(tidyverse)

df %>%
  mutate(type1 = na_if(type, 'main')) %>%
  fill(type1) %>%
  mutate(type1 = coalesce(na_if(type1, 'text_box_end'), type),
         type1 = recode(type1, text_box_end = 'text_box_start'))

 chpt page_num paragraph           type text_type      text          type1
1     2        9        11           main      text Sentences           main
2     2        9        12           main      text Sentences           main
3     2        9        14 text_box_start    header Sentences text_box_start
4     2        9        15           main      text Sentences text_box_start
5     2        9        16           main      text Sentences text_box_start
6     2        9        17   text_box_end      text Sentences text_box_start
7     2       10        19           main      text Sentences           main
8     2       10        20           main      text Sentences           main

CodePudding user response：

One option is to change type to "textbox" if it contains "start" or "end" and cumulatively count the number of starts/ends and if it's an odd number (i.e. it lies between a 'start' and an 'end') change it to "textbox", i.e.

library(tidyverse)

df <- read.table(text = "chpt   page_num    paragraph   type    text_type   text
2   9   11  main    text    Sentences
2   9   12  main    text    Sentences
2   9   14  text_box_start  header  Sentences
2   9   15  main    text    Sentences
2   9   16  main    text    Sentences
2   9   17  text_box_end    text    Sentences
2   10  19  main    text    Sentences
2   10  20  main    text    Sentences",
header = TRUE)

# Tidyverse
df %>%
  mutate(type = ifelse(str_detect(type, "start|end") |
                         cumsum(str_detect(type, "start|end")) %% 2 == 1,
                       "textbox", type))
#>   chpt page_num paragraph    type text_type      text
#> 1    2        9        11    main      text Sentences
#> 2    2        9        12    main      text Sentences
#> 3    2        9        14 textbox    header Sentences
#> 4    2        9        15 textbox      text Sentences
#> 5    2        9        16 textbox      text Sentences
#> 6    2        9        17 textbox      text Sentences
#> 7    2       10        19    main      text Sentences
#> 8    2       10        20    main      text Sentences

# Base r
df$type <- ifelse(grepl("start|end", df$type) |
                    cumsum(grepl("start|end", df$type)) %% 2 == 1,
                  "textbox", df$type)
df
#>   chpt page_num paragraph    type text_type      text
#> 1    2        9        11    main      text Sentences
#> 2    2        9        12    main      text Sentences
#> 3    2        9        14 textbox    header Sentences
#> 4    2        9        15 textbox      text Sentences
#> 5    2        9        16 textbox      text Sentences
#> 6    2        9        17 textbox      text Sentences
#> 7    2       10        19    main      text Sentences
#> 8    2       10        20    main      text Sentences

^{Created on 2022-08-24 by the reprex package (v2.0.1)}

CodePudding user response：

Assuming, all text boxes have start and end (as it's usually in valid HTML) you can grep for 'text_box' and use a 2-column matrix which will give you the edges of respective row sequences to change to 'text_box'.

dat[apply(matrix(grep('text_box', dat$type), 2), 2, \(x) do.call(seq, as.list(x))), 'type'] <- 'text_box'
dat
#    chpt page_num paragraph     type text_type      text
# 1     2        9        11     main      text Sentences
# 2     2        9        12     main      text Sentences
# 3     2        9        14 text_box    header Sentences
# 4     2        9        15 text_box      text Sentences
# 5     2        9        16 text_box      text Sentences
# 6     2        9        17 text_box      text Sentences
# 7     2       10        19     main      text Sentences
# 8     2       10        20     main      text Sentences
# 9     2        9        11     main      text Sentences
# 10    2        9        12     main      text Sentences
# 11    2        9        14 text_box    header Sentences
# 12    2        9        15 text_box      text Sentences
# 13    2        9        16 text_box      text Sentences
# 14    2        9        17 text_box      text Sentences
# 15    2       10        19     main      text Sentences
# 16    2       10        20     main      text Sentences