I'm new to pandas and have tried str.split()
and str.extract()
to split column POS
of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _
and -
delimiters in subset):
CHROM | POS | REF | ALT |
---|---|---|---|
7 | 105121514 | C | T |
17 | 7359940 | C | A |
X | 76777781 | A | G |
16 | 70531965-70531965 | C | G |
6 | 26093141-26093141 | G | A |
12 | 11905465 | C | T |
4 | 103527484_103527848 | G | A |
I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):
CHROM | POS | REF | ALT |
---|---|---|---|
7 | 105121514 | C | T |
17 | 7359940 | C | A |
X | 76777781 | A | G |
16 | 70531965 | C | G |
6 | 26093141 | G | A |
12 | 11905465 | C | T |
4 | 103527484 | G | A |
My attempts have either split the rows only containing a delimiter and dropping all other rows or dropping all rows containing just the delimiters. Any guidance or suggestions appreciated! Thanks!
CodePudding user response:
This is not the most popular solution, but you can try.
df.POS = df.POS.str.replace("-", " ")
df.POS = df.POS.str.replace("_", " ")
df.POS = df.POS.str.split()
df.POS = [x[0] for x in df.POS]
CodePudding user response:
A possible solution, based on the idea of replacing all characters after _
or -
(inclusive) with the empty string (''
):
df['POS'] = df['POS'].str.replace(r'[-|_]\d ', '')
Output:
CHROM POS REF ALT
0 7 105121514 C T
1 17 7359940 C A
2 X 76777781 A G
3 16 70531965 C G
4 6 26093141 G A
5 12 11905465 C T
6 4 103527484 G A
CodePudding user response:
If you are on pandas >= 1.4, you can use a regex with str.split()
. Combine this with expand=True
and then just take the first result and I think you've got what you need.
df['POS'] = df['POS'].str.split('[-_]',expand=True,regex=True)[0]