Home > Software engineering >  How can I delete irregular chunks of words in R?
How can I delete irregular chunks of words in R?

Time:12-02

It is reproducible example.

df2 <- data.frame(Num = c(1,2,3), Comment = c('nick       comment12021.12.01      nickn comment2222021.12.02       nickname333       commennnnt222021.12.01', 'nick       comment12021.12.01      nickn comment2222021.12.02       nickname333       commeeeent222021.12.01','nick       comment12021.12.01      nickn      comment2222021.12.02       nickname3333333       comment22021.12.01') )
Num           Comment
----------------------------------------------------------------------------
1      Tom    comment1~   Jay     comment2     Yun    comment 3 ~
2      Tim    comment1~   Cristal     comment2~      Lomio    comment3~
3      Tracer  comment1~   Teemo   comment2~      Irelia   comment3~
--------------------------------------------------------------------------

I have a dataframe with 2 columns and many rows. These are comments I got from crawling a website. However, since it is a very dynamic website, I had no choice but to get nicknames and comments from multiple people at once.

I want to delete nicknames from this irregular chunk of text and create a word cloud with only comments. But I can't think of a way to delete only the nickname. The length of nicknames and comments is irregular, so I can't do it the way I know.

CodePudding user response:

If you have a fixed separator (like exactly seven spaces (" {7}" using regular expressions) you mentioned in your comments), you can do the following:

dd <- data.frame(
  id = 1:3,
  comment = c(
    "Tom       comment1~       Jay       comment2~       Yun       comment3~",
    "Tim       comment1~       Cristal       comment2~       Lomio       comment3~",
    "Tracer       comment1~       Teemo       comment2~       Irelia       comment3~"
  )
)


extract_comments <- function(comments) {
  lapply(
    comments, 
    function(x) {
      sp <- strsplit(x, " {7}")[[1]]
      sp <- trimws(sp)
      ppl <- seq(1, length(sp), by = 2)
      data.frame(
        ex_person = sp[ppl],
        ex_comment = sp[ppl   1]
      )
    }
  )
}

dd$extracted <- extract_comments(dd$comment)

tidyr::unnest(dd, extracted)
#> # A tibble: 9 x 4
#>      id comment                             ex_person ex_comment
#>   <int> <chr>                               <chr>     <chr>     
#> 1     1 Tom       comment1~       Jay     ~ Tom       comment1~ 
#> 2     1 Tom       comment1~       Jay     ~ Jay       comment2~ 
#> 3     1 Tom       comment1~       Jay     ~ Yun       comment 3 
#> 4     2 Tim       comment1~       Cristal ~ Tim       comment1~ 
#> 5     2 Tim       comment1~       Cristal ~ Cristal   comment2~ 
#> 6     2 Tim       comment1~       Cristal ~ Lomio     comment3~ 
#> 7     3 Tracer       comment1~       Teemo~ Tracer    comment1~ 
#> 8     3 Tracer       comment1~       Teemo~ Teemo     comment2~ 
#> 9     3 Tracer       comment1~       Teemo~ Irelia    comment3~ 
  • Related