Home > Back-end >  extracting string between characters in a dataframe in r
extracting string between characters in a dataframe in r

Time:11-13

Hei,

I have to extract everything that's between "|" in a dataframe.

I don't think there is the need for reproducible data but this is the first row of the dataframe as an example

Accession                      FASTA                                                                                                                                                                                               
  <chr>                          <chr>                                                                                                                                                                                               
1 tr|A0A1G4NSV4|A0A1G4NSV4_9FLOR MLNIRPDEISNIIRQQIEKYDQKVQVANVGTVLQVGDGIARVYGLDDVMAGELLEFEDKTIGVALNLESDNVGVVLMGNGRDILEGSSVRATGKIAQIPVGEKFLGRVVNPLAEPIDGKGEINTSDNRLIESSAPGIIGRQSVCEPLQTGITAIDSMIPIGRGQRELIIGDRQTGKTAVALDTIINQKGQDVICV~
2 tr|A0A1C9CHB7|A0A1C9CHB7_PALPL MGNTKVSRRFRAMSELVQDKNYNYTEAIELLRRSSSAKFVETAEAHIVLGLDPKYADQQLRSTVILPKGTGKLAKVAVITKGEKITEALSAGADLVGAEDVIEQILQGNIDFDKLIATPDIMPLIAKLGRVLGPRGLMPSPKAGTVTIDVGQAVQEFKLGKLEYRLDKTGIVHIPFGKVNFSKEDLAANLLAIKE~
3 tr|A0A1C9CHD7|A0A1C9CHD7_PALPL MPHFTLKVLWLENNIAIAIDQIVGKGTSPLTSYFFWPRNDAWEHLKSELESKPWILEIDRINLLNQATEVINYWQEEGKNNSITKAQLKFPDFLFSGSH                                                                                                 
4 tr|A0A6C0W2A1|A0A6C0W2A1_PALDE MALYNKKLSPIKKTEVLDYKDIDLLRKFITEQGKILPRRSTGLTSKQQKKLTKAIKQARILALLPFLNKD                                                                                                                              
5 tr|R7QB42|R7QB42_CHOCR         MAFISFPSTFIGTNVKAASFSRRSRSAVRTTPIASAVPRNANLKKLQAGYLFPEIGRRRRAYLEQNPGADIISLGVGDTTMPIPEHICSGLVGGASKLGTEEGYSGYGAEQGMGPLREKIAQVLYKGTVKSDEVFVSDGAKCDISRLQQVFGATATVAVQDPSYPVYVDTSVMMGQTGLYDESKGQFEGIQYMQC~
6 tr|A0A3G1I907|A0A3G1I907_9FLOR MIKKGDVVKITRKESYWYQENGTVIKVESEIKYPVLVRFEKEAYNGVNSNNFAEDEVVVLK                                                                                                                                       

How do I do that?

CodePudding user response:

Assuming each row has an | in it

lapply(strsplit(df$Accession,"|"),"[[",2)

CodePudding user response:

This might also help you. I only used a single string assuming you know how to apply the code on your data set:

  • (?<=\\|) positive look-behind meaning the desired string should be preceded by a literal |
  • (?=\\|) positive look-ahead meaning the desired string should be followed by a literl | Both of these characters are not captured and then:
  • [^|]* any character aside from a literal | zero or multiple times.\
vec <- c("tr|A0A1G4NSV4|A0A1G4NSV4_9FLOR")

regmatches(vec, regexpr("(?<=\\|)[^|]*(?=\\|)", vec, perl = TRUE))
[1] "A0A1G4NSV4"
  • Related