I have a data set with Word
s and their Tag
s:
mydat <- structure(list(Word = c("Acanthosis", "nigricans", "AN", "skin",
"condition", "hyperkeratosis", "of", "the", "skin", "AN", "obesity",
"drug", "-", "induced", "AN", "AN", "malignant", "AN"), Tag = c("B",
"I", "B", "B", "I", "B", "I", "I", "I", "B", "B", "B", "I", "I",
"I", "B", "B", "I")), row.names = c(NA, -18L), class = c("data.table",
"data.frame"), .internal.selfref = <pointer: 0x13801b8e0>)
> mydat
Word Tag
1: Acanthosis B
2: nigricans I
3: AN B
4: skin B
5: condition I
6: hyperkeratosis B
7: of I
8: the I
9: skin I
10: AN B
11: obesity B
12: drug B
13: - I
14: induced I
15: AN I
16: AN B
17: malignant B
18: AN I
I would like to break up the column of Word
s into a vector of strings where each string starts with a B
tag. The I
tag signifies that it's still the same string. For example, given
Acanthosis B
nigricans I
AN B
skin B
condition I
AN B
obesity B
drug B
... ...
Acanthosis nigricans
, AN
, skin condition
, AN
, obesity
, are strings because they each start with a word with a B
tag. If the string is more than 1 word long, then I'll include all words with I
tags until I reach the next B
tag in the list.
Altogether, The desired output is:
> mystrings
[1] "Acanthosis nigricans" "AN" "skin condition"
[4] "hyperkeratosis of the skin" "AN" "obesity"
[7] "drug-induced AN" "AN" "malignant AN"
Is there a way to do this in R? One thought is to loop over each row and check the tags. However, this would be very inefficient if the dataset has many rows.
CodePudding user response:
Create a grouping column with cumsum
on the logical vector and then do paste
library(data.table)
out <- mydat[, gsub("\\s -\\s ", "-", paste(Word, collapse = " ")),
.(grp = cumsum(Tag == "B"))]
-output
> out$V1
[1] "Acanthosis nigricans" "AN" "skin condition" "hyperkeratosis of the skin" "AN"
[6] "obesity" "drug-induced AN" "AN" "malignant AN"