Remove first (if C) and last letter if (F or W)-CodePudding

I want to remove the first letter if it is a C and the last letter if it is a F or a W. But when I use :

df1['trimmed_seq'] = df1['seq'].str.strip("CFW")

Input:

    seq
0   CASSAQGTGDRGYTF
1   CASSLVATGNTGELFF
2   CASSKGTVSGLSG
3   CALKVGADTQYF
4   CASSLWASGRGGTGELFF
5   CASSLLGWEQLDEQFF
6   CASSSGTGVYGYTF
7   CASSPLEWEGVTEAFF
8   CASSFWSSGRGGTDTQYF
9   CASSAGQGASDEQFF

Output:

    seq
0   ASSAQGTGDRGYT
1   ASSLVATGNTGEL
2   ASSKGTVSGLSG
3   ALKVGADTQY
4   ASSLWASGRGGTGEL
5   ASSLLGWEQLDEQ
6   ASSSGTGVYGYT
7   ASSPLEWEGVTEA
8   ASSFWSSGRGGTDTQY
9   ASSAGQGASDEQ

The problem I have is that for example for line '1' both F's at the end are removed and in case a sequence would end with CFW all this would be removed.

So my question is: can this be solved somehow using the same str.strip function?

CodePudding user response：

This is not possible using strip because it has no notion of a maximum number of characters to remove. So I would use replace and an regex to remove an optional prefix and an optional suffix:

df['seq'].str.replace(r'^C?(.*?)[FW]?$', r'\1')

It gives as expected:

0       ASSAQGTGDRGYT
1      ASSLVATGNTGELF
2        ASSKGTVSGLSG
3          ALKVGADTQY
4    ASSLWASGRGGTGELF
5      ASSLLGWEQLDEQF
6        ASSSGTGVYGYT
7      ASSPLEWEGVTEAF
8    ASSFWSSGRGGTDTQY
9       ASSAGQGASDEQF
Name: seq, dtype: object

CodePudding user response：

You can use loc operations to filter out the required rows and .str to perform the string formatting

mask = (df.seq.str[0] == 'C') 
df.loc[mask, "seq"] = df.loc[mask, "seq"].str[1:]

mask = (df.seq.str[-1] == 'F') | (df.seq.str[-1] == 'W') 
df.loc[mask, "seq"] = df.loc[mask, "seq"].str[:-1]