I have a column in a dataframe that lists DNA sequences, I would like to do the following two things. Below is an example of the data set
d = [['ampC','tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc'], ['yifL','acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat'],['glyW','tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg']]
df = pd.DataFrame(d, columns = ['gene','Sequence'])
gene | Sequence |
---|---|
ampC | tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc |
yifL | acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat |
glyW | tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg |
- Extract the capital letter and everything before it. With
str.extract(r"(.*?)[A-Z] ", expand=True)
I can get everything before the capital letter but I need help figuring out how to get the capital letter as well.
Example of what I'm trying to get for ampC: tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcA
- How to extract the 16th letter before the capital letter.
Example of what I'm trying to get for the following 3 genes:
gene | letter |
---|---|
ampC | c |
yifL | g |
glyW | t |
[c, g, t]
Thank you in advance for all your help. Sorry if a question like this was asked before, I couldn't a find a solution anywhere.
CodePudding user response:
You may try:
df["SubSequence"] = df["Sequence"].str.extract(r'^(.*?[A-Z])')
df["letter"] = df["Sequence"].str.extract(r'^[acgt]*([acgt])[acgt]{15}[A-Z]')
CodePudding user response:
Your regular expression is almost what you need. Just move the capital letters inside the group. Try with:
df["substring"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0]
df["letter"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0].str[-17]
>>> df[["gene", "letter"]]
gene letter
0 ampC c
1 yifL g
2 glyW t