I'm working with a lot of text files I have loaded into R and I'm trying to replace every instance (or tag) of </SPEAKER>
with a certain string found earlier in the text file.
Example:
"<BOB> Lots of text here </SPEAKER> <HARRY> More text here by a different speaker </SPEAKER>"
I'd like to replace every instance of "</SPEAKER>"
with the name of, say "<BOB>"
and "<HARRY>"
based on the NAME that has been found earlier, so I'd get this at the end:
"<BOB> Lots of text here </BOB> <HARRY> More text here by a different speaker </HARRY>"
I was thinking of looping through the vector text but as I only have limited experience with R, I wouldn't know how to tackle this.
If anyone has any suggestions for how to do this, possibly even outside of R using Notepad or another text/tag editor, I'd most appreciate any help.
Thanks!
CodePudding user response:
Match
<
,- word characters (capturing them in capture group 1),
>
,- the shortest string (capturing it in capture group 2) until
</SPEAKER>
and then replace that with the
<
,- capture group 1,
>
,- capture group 2 and
</
followed by- capture group 1 and
>
This gives
x <- "<BOB> Lots of text here </SPEAKER> <HARRY> More text here by a different speaker </SPEAKER>"
gsub("<(\\w )>(.*?)</SPEAKER>", "<\\1>\\2</\\1>", x)
## [1] "<BOB> Lots of text here </BOB> <HARRY> More text here by a different speaker </HARRY>"