How do I use regex to remove substring before a pipe in pandas dataframe?-CodePudding

In cna pandas dataframe, for all values of Hugo_symbol column, if there is a pipe (|) followed by "ENSG*", remove everything before the pipe.

My code:

import re
cna["Hugo_Symbol"] = [re.sub(r"^\|.*", "", str(x)) for x in cna["Hugo_Symbol"]]

Current cna dataframe

	Hugo_Symbol	TCGA_1	TCGA_2	TCGA_3
0	GENEID\|ENSG12345	0.1	0.2	0.3
1	GENEA	0.4	0.5	0.6
2	ANOTHERGENEID\|ENSG6789	0.7	0.8	0.9
3	GENEB	1.0	1.1	1.2

Desired output

	Hugo_Symbol	TCGA_1	TCGA_2	TCGA_3
0	ENSG12345	0.1	0.2	0.3
1	GENEA	0.4	0.5	0.6
2	ENSG6789	0.7	0.8	0.9
3	GENEB	1.0	1.1	1.2

CodePudding user response：

You need to use a Series.str.replace:

cna["Hugo_Symbol"] = cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)

Details:

^ - start of string
[^|]* - zero or more chars other than |
\| - a | char.

See the regex demo.

Pandas test:

import pandas as pd
cna = pd.DataFrame({'Hugo_Symbol':['GENEID|ENSG12345', 'GENEA'], 'TCGA_1':[0.1, 0.4]})
cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)
0    ENSG12345
1        GENEA
Name: Hugo_Symbol, dtype: object

NOTE on regex=True:

Acc. to Pandas 1.2.0 release notes:

The default value of regex for Series.str.replace() will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).

CodePudding user response：

you can use a simple regex wth str.replace:

cna['Hugo_Symbol'] = cna['Hugo_Symbol'].str.replace(r'^(.*\|)', '', regex=True)

output:

  Hugo_Symbol  TCGA_1  TCGA_2  TCGA_3
0   ENSG12345     0.1     0.2     0.3
1       GENEA     0.4     0.5     0.6
2    ENSG6789     0.7     0.8     0.9
3       GENEB     1.0     1.1     1.2

regex demo