Extracting Info From A Column that contains irregular structure of ";" and "|" s-CodePudding

I have a pandas data frame in which one of the columns looks like this.

INFO
SVTYPE=CNV;END=401233
SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345|
SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|...

The information I want to extract in new columns is gene|ESNT12345 . For the same example should be

gene1               gene2               gene3
Na                  Na                  Na
BHAT12|ESNT12345 Na                  Na
JHV87|ESNT12345 HJJUB2|ESNT12345 Na
GFTREF|ESNT12345 321lkj|ESNT12345 16-YHGT|ESNT12345

How can I do this working with pandas? I have been trying with .apply(lambda x:x.split("|"). But as I don't know the number of gene_name|ESNT12345 my dataset has and also this will be used in an application that will take thousands of different data frames, I am looking for a way of dynamically creating the necessary columns.

How can I do this?

CodePudding user response：

IIUC, you could use a regex and str.extractall.

joining to the original data:

new_df = df.join(
 df['INFO']
 .str.extractall(r'(\w \|ESNT\d )')[0]
 .unstack(level='match')
 .add_prefix('gene_')
)

output:

                                                                                         INFO            gene_0            gene_1          gene_2
0                                                                       SVTYPE=CNV;END=401233               NaN               NaN             NaN
1                                              SVTYPE=CNV;END=401233;CSQT=1|BHAT12|ESNT12345|  BHAT12|ESNT12345               NaN             NaN
2                           SVTYPE=CNV;END=401233;CSQT=1|JHV87|ESNT12345|,1|HJJUB2|ESNT12345|   JHV87|ESNT12345  HJJUB2|ESNT12345             NaN
3  SVTYPE=CNV;END=401233;CSQT=1|GFTREF|ESNT12345|,1|321lkj|ESNT12345|,1|16-YHGT|ESNT12345|...  GFTREF|ESNT12345  321lkj|ESNT12345  YHGT|ESNT12345

without joining to the original data:

new_df = (df['INFO']
 .str.extractall(r'(\w \|ESNT\d )')[0]
 .unstack(level='match')
 .add_prefix('gene_')
 .reindex(df.index)
)

output:

match            gene_0            gene_1          gene_2
0                   NaN               NaN             NaN
1      BHAT12|ESNT12345               NaN             NaN
2       JHV87|ESNT12345  HJJUB2|ESNT12345             NaN
3      GFTREF|ESNT12345  321lkj|ESNT12345  YHGT|ESNT12345

regex hack to have gene1, gene2…

If you really want to have the genes counter to start with 1, you could use this small regex hack (match the beginning of the string as match 0 and drop it):

new_df = (df['INFO']
 .str.extractall(r'(^|\w \|ESNT\d )')[0]
 .unstack(level='match')
 .iloc[:, 1:]
 .add_prefix('gene')
 .reindex(df.index)
)

output:

match             gene1             gene2           gene3
0                   NaN               NaN             NaN
1      BHAT12|ESNT12345               NaN             NaN
2       JHV87|ESNT12345  HJJUB2|ESNT12345             NaN
3      GFTREF|ESNT12345  321lkj|ESNT12345  YHGT|ESNT12345