Home > OS >  pandas split values in column
pandas split values in column

Time:11-12

I'm new to pandas and have tried str.split() and str.extract() to split column POS of numerical values with no success. My dataframe is about 3000 lines and is structured like this (note _ and - delimiters in subset):

CHROM POS REF ALT
7 105121514 C T
17 7359940 C A
X 76777781 A G
16 70531965-70531965 C G
6 26093141-26093141 G A
12 11905465 C T
4 103527484_103527848 G A

I would like for the dataframe to look like this (i.e. retain values preceding all delimiters):

CHROM POS REF ALT
7 105121514 C T
17 7359940 C A
X 76777781 A G
16 70531965 C G
6 26093141 G A
12 11905465 C T
4 103527484 G A

My attempts have either split the rows only containing a delimiter and dropping all other rows or dropping all rows containing just the delimiters. Any guidance or suggestions appreciated! Thanks!

CodePudding user response:

This is not the most popular solution, but you can try.

df.POS = df.POS.str.replace("-", " ")
df.POS = df.POS.str.replace("_", " ")
df.POS = df.POS.str.split()
df.POS = [x[0] for x in df.POS]

CodePudding user response:

A possible solution, based on the idea of replacing all characters after _ or - (inclusive) with the empty string (''):

df['POS'] = df['POS'].str.replace(r'[-|_]\d ', '')

Output:

  CHROM        POS REF ALT
0     7  105121514   C   T
1    17    7359940   C   A
2     X   76777781   A   G
3    16   70531965   C   G
4     6   26093141   G   A
5    12   11905465   C   T
6     4  103527484   G   A

CodePudding user response:

If you are on pandas >= 1.4, you can use a regex with str.split(). Combine this with expand=True and then just take the first result and I think you've got what you need.

df['POS'] = df['POS'].str.split('[-_]',expand=True,regex=True)[0]
  • Related