How to replace multiple complex characters in the two columns in pandas?-CodePudding

I would like to replace some targets in two particular columns with the new characters. Here is my data.

classes = [('2.7.2.3', 'a primary alcohol',
            'an aldehyde'),
         ('2.7.1.3', 'a secondary alcohol',
          'a ketone'),
         ('3.1.1.3', 'an aldehyde   NADP( )',
          'a 3-oxoacyl-[ACP]   NADPH'),
         ('3.1.1.3', '3-oxoacyl-[ACP]   NAD( )',
          '2,3-dioxo-L-gulonate   NADH'),
         ('2.7.2.3', 'D-ribitol 5-phosphate   NADP( )',
          'a primary alcohol   H( )'),
         ('1.7.99.4', '2,3-dioxo-L-gulonate   NAD( )',
          'D-ribulose 5-phosphate   NADH'),
         ('1.1.1.304', 'L-iditol   NAD( )', ' H( )   keto-L-sorbose   NADH'),
         ('2.7.4.3', 'H2O', 'oxidized coenzyme F420-1'),
         ('4.1.1.68', 'myo-inositol   NAD( )', ' H( )   NADH   a secondary alcohol')]
labels = ['Ko_EC','From', 'to']
alls = pd.DataFrame.from_records(classes, columns=labels)

I want to replace all and some unique characters, namely,S = ['H2O','NADP( )','NADPH','NAD( )', 'NADH', 'H( )']. My code is :

alls['From'] = alls['From'].str.replace(" ", "")
alls['to'] = alls['to'].str.replace(" ", "")

S = ['H2O','NADP()','NADPH','NAD()', 'NADH', 'H()']

alls

However, it reported re.error: nothing to repeat at position 2.

The expected results, in which all the special targets included in S list were replaced, are:

       Ko_EC                           From                               to
0    2.7.2.3              a primary alcohol                      an aldehyde
1    2.7.1.3            a secondary alcohol                         a ketone
2    3.1.1.3                    an aldehyde                a 3-oxoacyl-[ACP]
3    3.1.1.3                3-oxoacyl-[ACP]             2,3-dioxo-L-gulonate 
4    2.7.2.3          D-ribitol 5-phosphate                a primary alcohol  
5   1.7.99.4           2,3-dioxo-L-gulonate           D-ribulose 5-phosphate
6  1.1.1.304                       L-iditol                   keto-L-sorbose  
7    2.7.4.3                                        oxidized coenzyme F420-1
8   4.1.1.68                   myo-inositol              a secondary alcohol

CodePudding user response：

For the strings in list you can use a lambda function:

S = ['H2O','NADP()','NADPH','NAD()', 'NADH', 'H()']

def list_remove(x):
    return ' '.join([el for el in x.split(' ') if el not in S])

alls['From'] = alls['From'].apply(lambda x: list_remove(x))
alls['to'] = alls['to'].apply(lambda x: list_remove(x))

CodePudding user response：

S = ['H2O','NADP( )','NADPH','NAD( )', 'NADH', 'H( )', ' ']
S

# escape the regex special characters in S list
# then create an OR string using join for use with replace

alls['From']=alls['From'].str.replace(rf"{'|'.join(map(re.escape, S))}", "", regex=True)   
alls['to']  =alls['to'].str.replace(rf"{'|'.join(map(re.escape, S))}", "", regex=True)   

alls

    Ko_EC       From                     to
0   2.7.2.3     a primary alcohol        an aldehyde
1   2.7.1.3     a secondary alcohol      a ketone
2   3.1.1.3     an aldehyde              a 3-oxoacyl-[ACP]
3   3.1.1.3     3-oxoacyl-[ACP]          2,3-dioxo-L-gulonate
4   2.7.2.3     D-ribitol 5-phosphate    a primary alcohol
5   1.7.99.4    2,3-dioxo-L-gulonate     D-ribulose 5-phosphate
6   1.1.1.304   L-iditol                 keto-L-sorbose
7   2.7.4.3                              oxidized coenzyme F420-1
8   4.1.1.68    myo-inositol             a secondary alcohol