Home > front end >  How to break down one observation into several sub-observations? [closed]
How to break down one observation into several sub-observations? [closed]

Time:09-27

My data frame contains several collected articles, df$title represents the title and df$text represents the content of each article. I need to break down each article into several paragraphs. Here is how I breakdown just ONE article:

pattern = "\\bM(?:rs?|s)\\.\\s"
aa <- str_replace_all( text1, pattern, "XXXX")
bb <- unlist(strsplit(aa, "XXXX"))
cc <- bb[-1]
dd <- gsub("[\\]", " ", cc)
paragraph vector <- gsub("[^[:alnum:]]", " ", dd)

How can I label each paragraph with the title of the article and apply the break down work to the whole column (df$text)? And I want each paragraph become one observation (instead of one article as a observation).

CodePudding user response:

This is a simple example in which each paragraph is separated by two blank lines:

library(tidyverse)

data <- tibble(
  title = c("The Book of words", "A poem"),
  text = c("It was a dark and stormy night. \n\n And this is another paragraph.", "This\n\nis\n\nthe\n\nEnd")
)

cat(data$text[[1]])
#> It was a dark and stormy night. 
#> 
#>  And this is another paragraph.
cat(data$text[[2]])
#> This
#> 
#> is
#> 
#> the
#> 
#> End

data %>%
  transmute(
    title,
    paragraph = text %>% map(~ {
      .x %>%
        str_split("\n\n") %>%
        simplify() %>%
        map_chr(str_trim)
    })
  ) %>%
  unnest(paragraph)
#> # A tibble: 6 × 2
#>   title             paragraph                      
#>   <chr>             <chr>                          
#> 1 The Book of words It was a dark and stormy night.
#> 2 The Book of words And this is another paragraph. 
#> 3 A poem            This                           
#> 4 A poem            is                             
#> 5 A poem            the                            
#> 6 A poem            End

Created on 2021-09-26 by the reprex package (v2.0.1)

  • Related