Home > Blockchain >  pandas: How to remove characters in a string contains parentheses and save it as a list of strings
pandas: How to remove characters in a string contains parentheses and save it as a list of strings

Time:11-16

I have a list:

x
['Unnamed: 0', 'TSPAN6 (ENSG00000000003)', 'TNMD (ENSG00000000005)',
       'DPM1 (ENSG00000000419)', 'SCYL3 (ENSG00000000457)',
       'C1orf112 (ENSG00000000460)', 'FGR (ENSG00000000938)',
       'CFH (ENSG00000000971)', 'FUCA2 (ENSG00000001036)',
       'GCLC (ENSG00000001084)',
       ...
       'ERCC-00157', 'ERCC-00158', 'ERCC-00160', 'ERCC-00162', 'ERCC-00163',
       'ERCC-00164', 'ERCC-00165', 'ERCC-00168', 'ERCC-00170', 'ERCC-00171'],
      dtype='object', length=52055)

I want to remove the parentheses and numbers inside for each one, I tried:

re.sub(r"\([^()]*\)", "",str(x))

'Unnamed: 0                                Unnamed: 0\nTSPAN6     TSPAN6 \nTNMD         TNMD \nDPM1         DPM1 \nSCYL3       SCYL3 \n                                      ...           \nERCC-00164                                ERCC-00164\nERCC-00165                                ERCC-00165\nERCC-00168                                ERCC-00168\nERCC-00170                                ERCC-00170\nERCC-00171                                ERCC-00171\nName: gene_type, Length: 52055, dtype: object'

but the outcome looks weird... I want [TSPAN6,TNMD,DPM1....]

Thanks

CodePudding user response:

From your data :

data = ['TSPAN6 (ENSG00000000003)', 'TNMD (ENSG00000000005)',
        'DPM1 (ENSG00000000419)', 'SCYL3 (ENSG00000000457)',
        'C1orf112 (ENSG00000000460)', 'FGR (ENSG00000000938)',
        'CFH (ENSG00000000971)', 'FUCA2 (ENSG00000001036)',
        'GCLC (ENSG00000001084)',
        'ERCC-00157', 'ERCC-00158', 'ERCC-00160', 'ERCC-00162', 'ERCC-00163',
        'ERCC-00164', 'ERCC-00165', 'ERCC-00168', 'ERCC-00170', 'ERCC-00171']

We can use the following regex to get the expected result :

import re

new_list = []
for elt in data:
    new_list.append(re.sub("[ \(\[].*?[\) \]]", "", elt))

Output :

['TSPAN6',
 'TNMD',
 'DPM1',
 'SCYL3',
 'C1orf112',
 'FGR',
 'CFH',
 'FUCA2',
 'GCLC',
 'ERCC-00157',
 'ERCC-00158',
 'ERCC-00160',
 'ERCC-00162',
 'ERCC-00163',
 'ERCC-00164',
 'ERCC-00165',
 'ERCC-00168',
 'ERCC-00170',
 'ERCC-00171']
  • Related