I want to remove the first letter if it is a C and the last letter if it is a F or a W. But when I use :
df1['trimmed_seq'] = df1['seq'].str.strip("CFW")
Input:
seq
0 CASSAQGTGDRGYTF
1 CASSLVATGNTGELFF
2 CASSKGTVSGLSG
3 CALKVGADTQYF
4 CASSLWASGRGGTGELFF
5 CASSLLGWEQLDEQFF
6 CASSSGTGVYGYTF
7 CASSPLEWEGVTEAFF
8 CASSFWSSGRGGTDTQYF
9 CASSAGQGASDEQFF
Output:
seq
0 ASSAQGTGDRGYT
1 ASSLVATGNTGEL
2 ASSKGTVSGLSG
3 ALKVGADTQY
4 ASSLWASGRGGTGEL
5 ASSLLGWEQLDEQ
6 ASSSGTGVYGYT
7 ASSPLEWEGVTEA
8 ASSFWSSGRGGTDTQY
9 ASSAGQGASDEQ
The problem I have is that for example for line '1' both F's at the end are removed and in case a sequence would end with CFW all this would be removed.
So my question is: can this be solved somehow using the same str.strip function?
CodePudding user response:
This is not possible using strip because it has no notion of a maximum number of characters to remove. So I would use replace and an regex to remove an optional prefix and an optional suffix:
df['seq'].str.replace(r'^C?(.*?)[FW]?$', r'\1')
It gives as expected:
0 ASSAQGTGDRGYT
1 ASSLVATGNTGELF
2 ASSKGTVSGLSG
3 ALKVGADTQY
4 ASSLWASGRGGTGELF
5 ASSLLGWEQLDEQF
6 ASSSGTGVYGYT
7 ASSPLEWEGVTEAF
8 ASSFWSSGRGGTDTQY
9 ASSAGQGASDEQF
Name: seq, dtype: object
CodePudding user response:
You can use loc
operations to filter out the required rows and .str
to perform the string formatting
mask = (df.seq.str[0] == 'C')
df.loc[mask, "seq"] = df.loc[mask, "seq"].str[1:]
mask = (df.seq.str[-1] == 'F') | (df.seq.str[-1] == 'W')
df.loc[mask, "seq"] = df.loc[mask, "seq"].str[:-1]