How do I extract a certain letter n#s before a specific pattern in a data frame in Python?-CodePudding

I have a column in a dataframe that lists DNA sequences, I would like to do the following two things. Below is an example of the data set

d = [['ampC','tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc'], ['yifL','acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat'],['glyW','tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg']]
df = pd.DataFrame(d, columns = ['gene','Sequence'])

gene	Sequence
ampC	tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcAtcgccaatgtaaatccggcccgcc
yifL	acttcataaagagtcgctaaacgcttgcttttacgtcttctcctgcgatgatagaaagcaGaaagcgatgaactttacaggcaat
glyW	tcaaaagtggtgaaaaatatcgttgactcatcgcgccaggtaagtagaatgcaacgcatcGaacggcggcactgattgccagacg

Extract the capital letter and everything before it. With str.extract(r"(.*?)[A-Z] ", expand=True) I can get everything before the capital letter but I need help figuring out how to get the capital letter as well.

Example of what I'm trying to get for ampC: tacggtctggctgctatcctgacagttgtcacgctgattggtgtcgttacaatctaacgcA

How to extract the 16th letter before the capital letter.

Example of what I'm trying to get for the following 3 genes:

gene	letter
ampC	c
yifL	g
glyW	t

[c, g, t]

Thank you in advance for all your help. Sorry if a question like this was asked before, I couldn't a find a solution anywhere.

CodePudding user response：

You may try:

df["SubSequence"] = df["Sequence"].str.extract(r'^(.*?[A-Z])')
df["letter"] = df["Sequence"].str.extract(r'^[acgt]*([acgt])[acgt]{15}[A-Z]')

CodePudding user response：

Your regular expression is almost what you need. Just move the capital letters inside the group. Try with:

df["substring"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0]
df["letter"] = df["Sequence"].str.extract(r"(.*?[A-Z])")[0].str[-17]

>>> df[["gene", "letter"]]
   gene letter
0  ampC      c
1  yifL      g
2  glyW      t