extracting a substring from a column in pandas-CodePudding

I have a pandas dataframe:

#CHROM  POS     INFO
chr1    111    AC=0;AN=33
chr1    111    AC=0;AN=100
chr1    111    AC=110;AN=51
chr2    737    AC=99;AN=10003
chr2    888    AC=100;AN=1636

I want to create a new column which is based on the numbers in INFO column. That is I want the numbers specified as AC=N. So the output should look like

#CHROM  POS     INFO            number
chr1    111    AC=0;AN=33.        0
chr1    111    AC=0;AN=100        0
chr1    111    AC=110;AN=51       110
chr2    737    AC=99;AN=10003.    99
chr2    888    AC=100;AN=1636.    100

Insights will be appreciated.

CodePudding user response：

use following code:

df['INFO'].str.split(r'[=;]').str.get(1)

output:

0      0
1      0
2    110
3     99
4    100
Name: INFO, dtype: object

CodePudding user response：

If you know that AC is first field, you can do

df['number'] = df.INFO.str.split(';').str[0].str.split('=').str[1].astype('int')

Or, with regex, indepentendly of position

df['number'] = df.INFO.str.findall(r'AC=(\d )').str[0].astype(int)

EDIT: rather than findall, we can

df['number'] = df.INFO.str.extract(r'AC=(\d )').astype(int)

It spares the .str[0] part. And so is the fastest so far (402 μs per run, vs 556 μs for my previous one, and 563 μs for Kim's version.