Home > database >  Replacing strings in vector: Every instance replaced by previous found instance
Replacing strings in vector: Every instance replaced by previous found instance

Time:02-14

I'm working with a lot of text files I have loaded into R and I'm trying to replace every instance (or tag) of </SPEAKER> with a certain string found earlier in the text file.

Example: "<BOB> Lots of text here </SPEAKER> <HARRY> More text here by a different speaker </SPEAKER>"

I'd like to replace every instance of "</SPEAKER>" with the name of, say "<BOB>" and "<HARRY>" based on the NAME that has been found earlier, so I'd get this at the end:

"<BOB> Lots of text here </BOB> <HARRY> More text here by a different speaker </HARRY>"

I was thinking of looping through the vector text but as I only have limited experience with R, I wouldn't know how to tackle this.

If anyone has any suggestions for how to do this, possibly even outside of R using Notepad or another text/tag editor, I'd most appreciate any help.

Thanks!

CodePudding user response:

Match

  • <,
  • word characters (capturing them in capture group 1),
  • >,
  • the shortest string (capturing it in capture group 2) until
  • </SPEAKER>

and then replace that with the

  • <,
  • capture group 1,
  • >,
  • capture group 2 and
  • </ followed by
  • capture group 1 and
  • >

This gives

x <- "<BOB> Lots of text here </SPEAKER> <HARRY> More text here by a different speaker </SPEAKER>"

gsub("<(\\w )>(.*?)</SPEAKER>", "<\\1>\\2</\\1>", x)
## [1] "<BOB> Lots of text here </BOB> <HARRY> More text here by a different speaker </HARRY>"
  • Related