R - Remove all line breaks between repeating character-CodePudding

I am currently working on the data cleaning of a sentiment analysis, and am using a large dataset of news articles in the form of a data frame. I need to be able to analyze one article per row of the data frame, and am looking for a way to remove line breaks between the first ‘======‘ and the second ‘======‘, repeating throughout the entire dataframe. Also, after the content has «collapsed onto itself», I would like the publisher and date column to remain.

df <-  matrix(c("======","NA","NA","Daily Bugle Dec 31","Daily Bugle", "Dec 31" ,"Wookies are","NA","NA",". recreationally", "NA","NA", "using drugs at a", "NA", "NA", "higher rate than", "NA", "NA","ever before.", "NA", "NA","======", "NA", "NA" ),ncol=3,byrow=TRUE)
colnames(df) <- c("content","publisher","date")
df <- as.data.frame(df)
df[ df == "NA" ] <- NA

Gives this:

content              publisher   date
======               <NA>         <NA>
Daily Bugle, Dec 31  Daily Bugle Dec 31
Wookies are          <NA>         <NA>
recreationally       <NA>         <NA>
using drugs at a     <NA>         <NA>
higher rate than     <NA>         <NA>
ever before.         <NA>         <NA>
======               <NA>         <NA>

I would like something like this:

content                                           publisher     date
======
Wookies are recreationally using drugs at a hig... Daily Bugle Dec 31           
======
Article 2
======
Article 3
======

Hope this was clear. I am relatively new to R.

CodePudding user response：

Every article starts with '===' so that can be used as an article number.
Drop the first value of content for each article.
Keep the 1st value of publisher and date.

library(dplyr)

df %>%
  mutate(article_no = cumsum(grepl('===', content))) %>%
  filter(!grepl('===', content)) %>%
  group_by(article_no) %>%
  summarise(content = paste0(content[-1], collapse = ''), 
            publisher = publisher[1], 
            date = date[1])

#  article_no content                                                                 publisher   date  
#       <int> <chr>                                                                   <chr>       <chr> 
#1          1 Wookies are. recreationallyusing drugs at ahigher rate thanever before. Daily Bugle Dec 31

CodePudding user response：

To help you, first I need to prepare some data.

library(tidyverse)
articles = read.table(
  header = TRUE,sep = ",",text="
content,publisher,date
======,NA,NA
Daily News Dec 27,Daily News,Dec 27
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 28,Daily News,Dec 28
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily News Dec 30,Daily News,Dec 30
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Daily Bugle Dec 31,Daily Bugle,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
using drugs at a,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA
======,NA,NA
Weekly News Dec 31,Weekly News,Dec 31
Wookies are,NA,NA
. recreationally,NA,NA
higher rate than,NA,NA
ever before.,NA,NA
======,NA,NA") %>%
  as_tibble() %>% 
  mutate(publisher = ifelse(publisher=="NA", NA, publisher),
         date = ifelse(date=="NA", NA, date))
articles

output

# A tibble: 52 x 3
   content           publisher  date  
   <chr>             <chr>      <chr> 
 1 ======            NA         NA    
 2 Daily News Dec 27 Daily News Dec 27
 3 Wookies are       NA         NA    
 4 . recreationally  NA         NA    
 5 using drugs at a  NA         NA    
 6 higher rate than  NA         NA    
 7 using drugs at a  NA         NA    
 8 higher rate than  NA         NA    
 9 using drugs at a  NA         NA    
10 higher rate than  NA         NA    
# ... with 42 more rows

I hope this is what your data format is. For me, these are five articles.

Now let's add one convert function and a simple mutation.

fConvert = function(data) tibble(
  publisher = data$publisher[2],
  date = data$date[2],
  content = data %>% slice(3:(nrow(.)-1)) %>% 
    pull(content) %>% paste(collapse = " ")
)

articles %>% mutate(
  idArticle = ifelse(!is.na(publisher),1, 0) %>% 
    cumsum() %>% lead(default=.[length(.)]) 
) %>% group_by(idArticle) %>% 
  nest() %>% 
  group_modify(~fConvert(.x$data[[1]]))

output

# A tibble: 5 x 4
# Groups:   idArticle [5]
  idArticle publisher   date   content                                                                                            
      <dbl> <chr>       <chr>  <chr>                                                                                              
1         1 Daily News  Dec 27 Wookies are . recreationally using drugs at a higher rate than using drugs at a higher rate than u~
2         2 Daily News  Dec 28 Wookies are . recreationally using drugs at a higher rate than ever before. ever before. ever befo~
3         3 Daily News  Dec 30 Wookies are . recreationally using drugs at a higher rate than ever before. ever before.           
4         4 Daily Bugle Dec 31 Wookies are . recreationally using drugs at a higher rate than ever before.                        
5         5 Weekly News Dec 31 Wookies are . recreationally higher rate than ever before.

As you can see, I was able to extract five articles, despite their different lengths, and glue all the lines together into one content. Hope that's what you meant.