Home > OS >  How do I use regex to remove substring before a pipe in pandas dataframe?
How do I use regex to remove substring before a pipe in pandas dataframe?

Time:05-14

In cna pandas dataframe, for all values of Hugo_symbol column, if there is a pipe (|) followed by "ENSG*", remove everything before the pipe.

My code:

import re
cna["Hugo_Symbol"] = [re.sub(r"^\|.*", "", str(x)) for x in cna["Hugo_Symbol"]]

Current cna dataframe

Hugo_Symbol TCGA_1 TCGA_2 TCGA_3
0 GENEID|ENSG12345 0.1 0.2 0.3
1 GENEA 0.4 0.5 0.6
2 ANOTHERGENEID|ENSG6789 0.7 0.8 0.9
3 GENEB 1.0 1.1 1.2

Desired output

Hugo_Symbol TCGA_1 TCGA_2 TCGA_3
0 ENSG12345 0.1 0.2 0.3
1 GENEA 0.4 0.5 0.6
2 ENSG6789 0.7 0.8 0.9
3 GENEB 1.0 1.1 1.2

CodePudding user response:

You need to use a Series.str.replace:

cna["Hugo_Symbol"] = cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)

Details:

  • ^ - start of string
  • [^|]* - zero or more chars other than |
  • \| - a | char.

See the regex demo.

Pandas test:

import pandas as pd
cna = pd.DataFrame({'Hugo_Symbol':['GENEID|ENSG12345', 'GENEA'], 'TCGA_1':[0.1, 0.4]})
cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)
0    ENSG12345
1        GENEA
Name: Hugo_Symbol, dtype: object

NOTE on regex=True:

Acc. to Pandas 1.2.0 release notes:

The default value of regex for Series.str.replace() will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).

CodePudding user response:

you can use a simple regex wth str.replace:

cna['Hugo_Symbol'] = cna['Hugo_Symbol'].str.replace(r'^(.*\|)', '', regex=True)

output:

  Hugo_Symbol  TCGA_1  TCGA_2  TCGA_3
0   ENSG12345     0.1     0.2     0.3
1       GENEA     0.4     0.5     0.6
2    ENSG6789     0.7     0.8     0.9
3       GENEB     1.0     1.1     1.2

regex demo

  • Related