I am hoping that someone can help me with this. I am trying to create a corpus of texts. I have scanned the texts in and OCRed them, cleaned the text, and imported the text into a data frame in R.
Because of the way the scanning process worked, there is a carriage return at the end of each line, and a blank line after each paragraph.
I am wondering if there is an easy way to combine all of the rows from the same paragraph into one row. In other words, concatenate the rows if there is text in both rows, but not if it is "" or NA (either would be fine).
Here is an example of what I have (the actual df is much larger):
chpt | page | text | ||
---|---|---|---|---|
1 | Chpt 01 | 2 | Equations and formulae | |
2 | Chpt 01 | 2 | ||
3 | Chpt 01 | 2 | Equations, identities and formulae | |
4 | Chpt 01 | 2 | ||
5 | Chpt 01 | 2 | You will encounter a wide variety of equations in this course. Essentially, an | |
6 | Chpt 01 | 2 | equation is a statement equating two algebraic expressions that may be true or | |
7 | Chpt 01 | 2 | false depending upon the value(s) substituted for the variable(s). Values of the | |
8 | Chpt 01 | 2 | variables that make the equation true are called solutions or roots of the equation. | |
9 | Chpt 01 | 2 | All of the solutions to an equation comprise the solution set of the equation. | |
10 | Chpt 01 | 2 | ||
11 | Chpt 01 | 2 | An equation that is true for all possible values of the variable is called an identity. | |
12 | Chpt 01 | 2 | ||
13 | Chpt 01 | 2 | - different methods to solve a system of linear equations (maximum of | |
14 | Chpt 01 | 2 | three equations in three unknowns) |
What I would like:
chpt | page | text | |
---|---|---|---|
1 | Chpt 01 | 2 | Equations and formulae |
2 | Chpt 01 | 2 | |
3 | Chpt 01 | 2 | Equations, identities and formulae |
4 | Chpt 01 | 2 | |
5 | Chpt 01 | 2 | You will encounter a wide variety of equations in this course. Essentially, an equation is a statement equating two algebraic expressions that may be true or false depending upon the value(s) substituted for the variable(s). Values of the variables that make the equation true are called solutions or roots of the equation. All of the solutions to an equation comprise the solution set of the equation. |
10 | Chpt 01 | 2 | |
11 | Chpt 01 | 2 | An equation that is true for all possible values of the variable is called an identity. |
12 | Chpt 01 | 2 | |
13 | Chpt 01 | 2 | - different methods to solve a system of linear equations (maximum of three equations in three unknowns) |
The example df itself looks like this:
structure(list(chpt = c("Chpt 01", "Chpt 01", "Chpt 01", "Chpt 01",
"Chpt 01", "Chpt 01", "Chpt 01", "Chpt 01", "Chpt 01", "Chpt 01",
"Chpt 01", "Chpt 01", "Chpt 01", "Chpt 01"), page = c(2L, 2L,
2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L, 2L), text = c("Equations and formulae",
"", "Equations, identities and formulae", "", "You will encounter a wide variety of equations in this course. Essentially, an",
"equation is a statement equating two algebraic expressions that may be true or",
"false depending upon the value(s) substituted for the variable(s). Values of the",
"variables that make the equation true are called solutions or roots of the equation.",
"All of the solutions to an equation comprise the solution set of the equation.",
"", "An equation that is true for all possible values of the variable is called an identity.",
"", "- different methods to solve a system of linear equations (maximum of",
"three equations in three unknowns)")), class = "data.frame", row.names = c(NA,
-14L))
CodePudding user response:
One possible solution
tmp=rle(!df$text=="")
df$grp=rep(1:length(tmp$lengths),tmp$lengths)
aggregate(
text~chpt page grp,
data=df,
paste0,
collapse=""
)
which looks something like this
1 Chpt 01 2 1 Equations and formulae
2 Chpt 01 2 2
3 Chpt 01 2 3 Equations, identities and formulae
4 Chpt 01 2 4
5 Chpt 01 2 5 You will encounter a wide variety of equations in this course. Essentially, anequation is a statement equating two algebraic expressions that may be true orfalse depending upon the value(s) substituted for the variable(s). Values of thevariables that make the equation true are called solutions or roots of the equation.All of the solutions to an equation comprise the solution set of the equation.
6 Chpt 01 2 6
7 Chpt 01 2 7 An equation that is true for all possible values of the variable is called an identity.
8 Chpt 01 2 8
9 Chpt 01 2 9 - different methods to solve a system of linear equations (m...
CodePudding user response:
With dplyr
data.table::rleid
:
library(dplyr)
df %>%
group_by(gp = data.table::rleid(text != "")) %>%
summarise(across(c(chpt, page), first),
text = paste(text, collapse = " "))
output (the totality of the string is not shown)
gp chpt page text
<int> <chr> <int> <chr>
1 1 Chpt 01 2 "Equations and formulae"
2 2 Chpt 01 2 ""
3 3 Chpt 01 2 "Equations, identities and formul…"
4 4 Chpt 01 2 ""
5 5 Chpt 01 2 "You will encounter a wide variet…"
6 6 Chpt 01 2 ""
7 7 Chpt 01 2 "An equation that is true for all…"
8 8 Chpt 01 2 ""
9 9 Chpt 01 2 "- different methods to solve a s…"