Home > OS >  Generating new columns from row values in Python
Generating new columns from row values in Python

Time:12-21

I have following pandas dataframe (HC_subset_umls)

    term            code            source  term_normlz     CUI         CODE        SAB     TTY     STR
0   B-cell lymphoma meddra:10003899 meddra  b-cell lymphoma C0079731    MTHU019696  OMIM    PTCS    b-cell lymphoma
1   B-cell lymphoma meddra:10003899 meddra  b-cell lymphoma C0079731    10003899    MDR     PT  b-cell lymphoma
2   Astrocytoma     meddra:10003571 meddra  astrocytoma     C0004114    10003571    MDR     PT  astrocytoma
3   Astrocytoma     meddra:10003571 meddra  astrocytoma     C0004114    D001254     MSH     MH  astrocytoma

I would like to group rows based on common CUI and generate new columns.

The desired output is:

    term            code            source  term_normlz     CUI         OMIM_CODE       OMIM_TTY        OMIM_STR  MDR_CODE      MDR_TTY     MDR_STR   MSH_CODE      MSH_TTY     MSH_STR
0   B-cell lymphoma meddra:10003899 meddra  b-cell lymphoma C0079731    MTHU019696      PTCS     b-cell lymphoma 10003899   PT  b-cell lymphoma  NA   NA   NA   NA
2   Astrocytoma     meddra:10003571 meddra  astrocytoma     C0004114    NA   NA   NA  10003571  MDR     PT  astrocytoma   D001254       MSH     MH  astrocytoma

I am using following lines of code.

HC_subset_umls['OMIM_CODE'] = (
    HC_subset_umls['CUI']
    .map(
        HC_subset_umls
        .groupby('CUI')
        .apply(lambda x: x.loc[x['SAB'].isin(['OMIM']), 'CODE'].values[0])
    )
)


HC_subset_umls['OMIM_TERM'] = (
    HC_subset_umls['CUI']
    .map(
        HC_subset_umls
        .groupby('CUI')
        .apply(lambda x: x.loc[x['SAB'].isin(['OMIM']), 'STR'].values[0])
    )
)

HC_subset_umls['OMIM_TTY'] = (
    HC_subset_umls['CUI']
    .map(
        HC_subset_umls
        .groupby('CUI')
        .apply(lambda x: x.loc[x['SAB'].isin(['OMIM']), 'TTY'].values[0])
    )
)

HC_subset_umls = HC_subset_umls[~(HC_subset_umls['SAB'].isin(['OMIM']))]

And subsequently for the other 'SAB' like 'MDR' and so on. However, I am getting following error.

IndexError: index 0 is out of bounds for axis 0 with size 0

Any help is highly appreciated.

CodePudding user response:

Try, using groupby, ustack, and flatten multiindex column headers.

df_out = (df.groupby(['term', 'code', 'source', 'term_normlz', 'CUI', 'SAB'])
            .first()
            .unstack()
            .swaplevel(0,1, axis=1))
df_out.columns = df_out.columns.map('_'.join)
df_out.reset_index()

Output:

    term             code  source      term_normlz       CUI  MDR_CODE MSH_CODE   OMIM_CODE MDR_TTY MSH_TTY OMIM_TTY          MDR_STR      MSH_STR         OMIM_STR
0      Astrocytoma  meddra:10003571  meddra      astrocytoma  C0004114  10003571  D001254         NaN      PT      MH      NaN      astrocytoma  astrocytoma              NaN
1  B-cell lymphoma  meddra:10003899  meddra  b-cell lymphoma  C0079731  10003899      NaN  MTHU019696      PT     NaN     PTCS  b-cell lymphoma          NaN  b-cell lymphoma
  • Related