I have a pandas dataframe:
#CHROM POS INFO
chr1 111 AC=0;AN=33
chr1 111 AC=0;AN=100
chr1 111 AC=110;AN=51
chr2 737 AC=99;AN=10003
chr2 888 AC=100;AN=1636
I want to create a new column which is based on the numbers in INFO column. That is I want the numbers specified as AC=N. So the output should look like
#CHROM POS INFO number
chr1 111 AC=0;AN=33. 0
chr1 111 AC=0;AN=100 0
chr1 111 AC=110;AN=51 110
chr2 737 AC=99;AN=10003. 99
chr2 888 AC=100;AN=1636. 100
Insights will be appreciated.
CodePudding user response:
use following code:
df['INFO'].str.split(r'[=;]').str.get(1)
output:
0 0
1 0
2 110
3 99
4 100
Name: INFO, dtype: object
CodePudding user response:
If you know that AC is first field, you can do
df['number'] = df.INFO.str.split(';').str[0].str.split('=').str[1].astype('int')
Or, with regex, indepentendly of position
df['number'] = df.INFO.str.findall(r'AC=(\d )').str[0].astype(int)
EDIT: rather than findall, we can
df['number'] = df.INFO.str.extract(r'AC=(\d )').astype(int)
It spares the .str[0]
part. And so is the fastest so far (402 μs per run, vs 556 μs for my previous one, and 563 μs for Kim's version.