Home > Net >  How to group words into strings based on another column in R?
How to group words into strings based on another column in R?

Time:01-31

I have a data set with Words and their Tags:

mydat <- structure(list(Word = c("Acanthosis", "nigricans", "AN", "skin", 
"condition", "hyperkeratosis", "of", "the", "skin", "AN", "obesity", 
"drug", "-", "induced", "AN", "AN", "malignant", "AN"), Tag = c("B", 
"I", "B", "B", "I", "B", "I", "I", "I", "B", "B", "B", "I", "I", 
"I", "B", "B", "I")), row.names = c(NA, -18L), class = c("data.table", 
"data.frame"), .internal.selfref = <pointer: 0x13801b8e0>)

> mydat
              Word Tag
 1:     Acanthosis   B
 2:      nigricans   I
 3:             AN   B
 4:           skin   B
 5:      condition   I
 6: hyperkeratosis   B
 7:             of   I
 8:            the   I
 9:           skin   I
10:             AN   B
11:        obesity   B
12:           drug   B
13:              -   I
14:        induced   I
15:             AN   I
16:             AN   B
17:      malignant   B
18:             AN   I

I would like to break up the column of Words into a vector of strings where each string starts with a B tag. The I tag signifies that it's still the same string. For example, given

      Acanthosis   B
       nigricans   I
              AN   B
            skin   B
       condition   I
              AN   B
         obesity   B
            drug   B
             ...   ...

Acanthosis nigricans, AN, skin condition, AN, obesity, are strings because they each start with a word with a B tag. If the string is more than 1 word long, then I'll include all words with I tags until I reach the next B tag in the list.

Altogether, The desired output is:

> mystrings
[1] "Acanthosis nigricans"       "AN"                         "skin condition"            
[4] "hyperkeratosis of the skin" "AN"                         "obesity"                   
[7] "drug-induced AN"            "AN"                         "malignant AN" 

Is there a way to do this in R? One thought is to loop over each row and check the tags. However, this would be very inefficient if the dataset has many rows.

CodePudding user response:

Create a grouping column with cumsum on the logical vector and then do paste

library(data.table)
out <-  mydat[, gsub("\\s -\\s ", "-", paste(Word, collapse = " ")),
  .(grp = cumsum(Tag == "B"))]

-output

> out$V1
[1] "Acanthosis nigricans"       "AN"                         "skin condition"             "hyperkeratosis of the skin" "AN"                        
[6] "obesity"                    "drug-induced AN"            "AN"                         "malignant AN"         
  • Related