I have a pandas data frame in which one of the columns looks like this.
INFO
SVTYPE=CNV;END=401233
SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|...
The information I want to extract in new columns is gene|ESNT12345
. For the same example should be
gene1 gene2 gene3
Na Na Na
BHAT12|ESNT12345 Na Na
JHV87|ESNT12345 HJJUB2|ESNT12345 Na
GFTREF|ESNT12345 321lkj|ESNT12345 16-YHGT|ESNT12345
How can I do this working with pandas? I have been trying with .apply(lambda x:x.split("|")
. But as I don't know the number of gene_name|ESNT12345
my dataset has and also this will be used in an application that will take thousands of different data frames, I am looking for a way of dynamically creating the necessary columns.
How can I do this?
CodePudding user response:
IIUC, you could use a regex and str.extractall
.
joining to the original data:
new_df = df.join(
df['INFO']
.str.extractall(r'(\w \|ESNT\d )')[0]
.unstack(level='match')
.add_prefix('gene_')
)
output:
INFO gene_0 gene_1 gene_2
0 SVTYPE=CNV;END=401233 NaN NaN NaN
1 SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345| BHAT12|ESNT12345 NaN NaN
2 SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345| JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|... GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
without joining to the original data:
new_df = (df['INFO']
.str.extractall(r'(\w \|ESNT\d )')[0]
.unstack(level='match')
.add_prefix('gene_')
.reindex(df.index)
)
output:
match gene_0 gene_1 gene_2
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345
regex hack to have gene1, gene2…
If you really want to have the genes counter to start with 1, you could use this small regex hack (match the beginning of the string as match 0
and drop it):
new_df = (df['INFO']
.str.extractall(r'(^|\w \|ESNT\d )')[0]
.unstack(level='match')
.iloc[:, 1:]
.add_prefix('gene')
.reindex(df.index)
)
output:
match gene1 gene2 gene3
0 NaN NaN NaN
1 BHAT12|ESNT12345 NaN NaN
2 JHV87|ESNT12345 HJJUB2|ESNT12345 NaN
3 GFTREF|ESNT12345 321lkj|ESNT12345 YHGT|ESNT12345