I just want to get rid of characters (or whatever you want to call it)that has ".xxxxx"
Gene_ID |
---|
ENSG00000000003.14 |
ENSG00000000005.5 |
ERCC-00164 |
ENSG00000002586.18_PAR_Y |
ENSG00000054803.3 |
ERCC-00012 |
ENSG00000284332.1 |
So this is how I want it to look like:
Gene_ID |
---|
ENSG00000000003 |
ENSG00000000005 |
ERCC-00164 |
ENSG00000002586 |
ENSG00000054803 |
ERCC-00012 |
ENSG00000284332 |
This is what I have tried:
df['Gene_ID'].str.replace('.',''))
but when I do that it only gets rid of the decimal not the characters that comes after the decimal point.
Note: the actual column is much longer than what I am showing on stack which has all that ".xxxx"
CodePudding user response:
Use Series.str.replace
with regex (\..*)$
for decimal and any value, $
is for end of string:
df['Gene_ID'] = df['Gene_ID'].str.replace('(\..*)$','', regex=True)
print (df)
Gene_ID
0 ENSG00000000003
1 ENSG00000000005
2 ERCC-00164
3 ENSG00000002586
4 ENSG00000054803
5 ERCC-00012
6 ENSG00000284332
CodePudding user response:
Check the comment above:
Note that .
is a metacharacter which represents Anything apart from the line breaks, hence to match a literal .
you need to escape it by a backslash or put in a character class ie inside brackets.
df['Gene_ID'] = df['Gene_ID'].str.replace('[.].*','', regex = True)
df
Gene_ID
0 ENSG00000000003
1 ENSG00000000005
2 ERCC-00164
3 ENSG00000002586
4 ENSG00000054803
5 ERCC-00012
6 ENSG00000284332