In cna
pandas dataframe, for all values of Hugo_symbol
column, if there is a pipe (|
) followed by "ENSG*", remove everything before the pipe.
My code:
import re
cna["Hugo_Symbol"] = [re.sub(r"^\|.*", "", str(x)) for x in cna["Hugo_Symbol"]]
Current cna
dataframe
Hugo_Symbol | TCGA_1 | TCGA_2 | TCGA_3 | |
---|---|---|---|---|
0 | GENEID|ENSG12345 | 0.1 | 0.2 | 0.3 |
1 | GENEA | 0.4 | 0.5 | 0.6 |
2 | ANOTHERGENEID|ENSG6789 | 0.7 | 0.8 | 0.9 |
3 | GENEB | 1.0 | 1.1 | 1.2 |
Desired output
Hugo_Symbol | TCGA_1 | TCGA_2 | TCGA_3 | |
---|---|---|---|---|
0 | ENSG12345 | 0.1 | 0.2 | 0.3 |
1 | GENEA | 0.4 | 0.5 | 0.6 |
2 | ENSG6789 | 0.7 | 0.8 | 0.9 |
3 | GENEB | 1.0 | 1.1 | 1.2 |
CodePudding user response:
You need to use a Series.str.replace
:
cna["Hugo_Symbol"] = cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)
Details:
^
- start of string[^|]*
- zero or more chars other than|
\|
- a|
char.
See the regex demo.
Pandas test:
import pandas as pd
cna = pd.DataFrame({'Hugo_Symbol':['GENEID|ENSG12345', 'GENEA'], 'TCGA_1':[0.1, 0.4]})
cna["Hugo_Symbol"].str.replace(r'^[^|]*\|', '', regex=True)
0 ENSG12345
1 GENEA
Name: Hugo_Symbol, dtype: object
NOTE on regex=True
:
Acc. to Pandas 1.2.0 release notes:
The default value of regex for
Series.str.replace()
will change from True to False in a future release. In addition, single character regular expressions will not be treated as literal strings when regex=True is set (GH24804).
CodePudding user response:
you can use a simple regex wth str.replace
:
cna['Hugo_Symbol'] = cna['Hugo_Symbol'].str.replace(r'^(.*\|)', '', regex=True)
output:
Hugo_Symbol TCGA_1 TCGA_2 TCGA_3
0 ENSG12345 0.1 0.2 0.3
1 GENEA 0.4 0.5 0.6
2 ENSG6789 0.7 0.8 0.9
3 GENEB 1.0 1.1 1.2