Hei,
I have to extract everything that's between "|" in a dataframe.
I don't think there is the need for reproducible data but this is the first row of the dataframe as an example
Accession FASTA
<chr> <chr>
1 tr|A0A1G4NSV4|A0A1G4NSV4_9FLOR MLNIRPDEISNIIRQQIEKYDQKVQVANVGTVLQVGDGIARVYGLDDVMAGELLEFEDKTIGVALNLESDNVGVVLMGNGRDILEGSSVRATGKIAQIPVGEKFLGRVVNPLAEPIDGKGEINTSDNRLIESSAPGIIGRQSVCEPLQTGITAIDSMIPIGRGQRELIIGDRQTGKTAVALDTIINQKGQDVICV~
2 tr|A0A1C9CHB7|A0A1C9CHB7_PALPL MGNTKVSRRFRAMSELVQDKNYNYTEAIELLRRSSSAKFVETAEAHIVLGLDPKYADQQLRSTVILPKGTGKLAKVAVITKGEKITEALSAGADLVGAEDVIEQILQGNIDFDKLIATPDIMPLIAKLGRVLGPRGLMPSPKAGTVTIDVGQAVQEFKLGKLEYRLDKTGIVHIPFGKVNFSKEDLAANLLAIKE~
3 tr|A0A1C9CHD7|A0A1C9CHD7_PALPL MPHFTLKVLWLENNIAIAIDQIVGKGTSPLTSYFFWPRNDAWEHLKSELESKPWILEIDRINLLNQATEVINYWQEEGKNNSITKAQLKFPDFLFSGSH
4 tr|A0A6C0W2A1|A0A6C0W2A1_PALDE MALYNKKLSPIKKTEVLDYKDIDLLRKFITEQGKILPRRSTGLTSKQQKKLTKAIKQARILALLPFLNKD
5 tr|R7QB42|R7QB42_CHOCR MAFISFPSTFIGTNVKAASFSRRSRSAVRTTPIASAVPRNANLKKLQAGYLFPEIGRRRRAYLEQNPGADIISLGVGDTTMPIPEHICSGLVGGASKLGTEEGYSGYGAEQGMGPLREKIAQVLYKGTVKSDEVFVSDGAKCDISRLQQVFGATATVAVQDPSYPVYVDTSVMMGQTGLYDESKGQFEGIQYMQC~
6 tr|A0A3G1I907|A0A3G1I907_9FLOR MIKKGDVVKITRKESYWYQENGTVIKVESEIKYPVLVRFEKEAYNGVNSNNFAEDEVVVLK
How do I do that?
CodePudding user response:
Assuming each row has an | in it
lapply(strsplit(df$Accession,"|"),"[[",2)
CodePudding user response:
This might also help you. I only used a single string assuming you know how to apply the code on your data set:
(?<=\\|)
positive look-behind meaning the desired string should be preceded by a literal|
(?=\\|)
positive look-ahead meaning the desired string should be followed by a literl|
Both of these characters are not captured and then:[^|]*
any character aside from a literal|
zero or multiple times.\
vec <- c("tr|A0A1G4NSV4|A0A1G4NSV4_9FLOR")
regmatches(vec, regexpr("(?<=\\|)[^|]*(?=\\|)", vec, perl = TRUE))
[1] "A0A1G4NSV4"