I have a list:
x
['Unnamed: 0', 'TSPAN6 (ENSG00000000003)', 'TNMD (ENSG00000000005)',
'DPM1 (ENSG00000000419)', 'SCYL3 (ENSG00000000457)',
'C1orf112 (ENSG00000000460)', 'FGR (ENSG00000000938)',
'CFH (ENSG00000000971)', 'FUCA2 (ENSG00000001036)',
'GCLC (ENSG00000001084)',
...
'ERCC-00157', 'ERCC-00158', 'ERCC-00160', 'ERCC-00162', 'ERCC-00163',
'ERCC-00164', 'ERCC-00165', 'ERCC-00168', 'ERCC-00170', 'ERCC-00171'],
dtype='object', length=52055)
I want to remove the parentheses and numbers inside for each one, I tried:
re.sub(r"\([^()]*\)", "",str(x))
'Unnamed: 0 Unnamed: 0\nTSPAN6 TSPAN6 \nTNMD TNMD \nDPM1 DPM1 \nSCYL3 SCYL3 \n ... \nERCC-00164 ERCC-00164\nERCC-00165 ERCC-00165\nERCC-00168 ERCC-00168\nERCC-00170 ERCC-00170\nERCC-00171 ERCC-00171\nName: gene_type, Length: 52055, dtype: object'
but the outcome looks weird... I want [TSPAN6,TNMD,DPM1....]
Thanks
CodePudding user response:
From your data :
data = ['TSPAN6 (ENSG00000000003)', 'TNMD (ENSG00000000005)',
'DPM1 (ENSG00000000419)', 'SCYL3 (ENSG00000000457)',
'C1orf112 (ENSG00000000460)', 'FGR (ENSG00000000938)',
'CFH (ENSG00000000971)', 'FUCA2 (ENSG00000001036)',
'GCLC (ENSG00000001084)',
'ERCC-00157', 'ERCC-00158', 'ERCC-00160', 'ERCC-00162', 'ERCC-00163',
'ERCC-00164', 'ERCC-00165', 'ERCC-00168', 'ERCC-00170', 'ERCC-00171']
We can use the following regex to get the expected result :
import re
new_list = []
for elt in data:
new_list.append(re.sub("[ \(\[].*?[\) \]]", "", elt))
Output :
['TSPAN6',
'TNMD',
'DPM1',
'SCYL3',
'C1orf112',
'FGR',
'CFH',
'FUCA2',
'GCLC',
'ERCC-00157',
'ERCC-00158',
'ERCC-00160',
'ERCC-00162',
'ERCC-00163',
'ERCC-00164',
'ERCC-00165',
'ERCC-00168',
'ERCC-00170',
'ERCC-00171']