Home > Net >  dataframe to presence/absence dataframe with 2 columns as comma separated strings
dataframe to presence/absence dataframe with 2 columns as comma separated strings

Time:10-28

I have a dataframe with 3 columns, the first one (annotations) is what I want to measure presence absence on and the latter two columns (categories, CUI_desc) contain comma separated lists of factors that I would like to become columns for a presence/absence dataframe

Currently, the data looks like this:

annotations                        categories                                   CUI_desc
heroine                            ['heroic', 'opioid', 'substance_abuse']      ['C0011892___heroin']
heroin                             ['heroic', 'opioid', 'substance_abuse']      ['C0011892___heroin']
he smoked two packs a day          ['opioid', 'substance_abuse']                ['C0439234___year', 'C0748223___QUIT', 'C0028040___nicotine']

And I would like it to look like this:

annotations               heroic   opioid   substance_abuse   COO1892___heroin   CO439234___year   CO748223___QUIT   COO22840___nicotine
heroine                   1        1        1                 1                  0                 0                 0
heroin                    1        1        1                 1                  0                 0                 0
he smoked two packs a day 0        1        1                 0                  1                 1                 1

I used this line of code from a similar question:

from collections import Counter
test = pd.DataFrame({k:Counter(v) for k, v in master.items()}).T.fillna(0).astype(int)

But got an undesired output:

                  heroine         heroin         he smoked two packs a day
annotations       1               1              1
categories        0               0              0
CUI_desc          0               0              0

It seems to be counting how many times a certain annotations shows up in my dataframe. This is likely because the above block of code is for a dictionary and not a dataframe.

CodePudding user response:

Edit: OP clarified that each cell is a string so we need to convert it into a list first before calling explode.

Assuming the index is unique:

from ast import literal_eval

categories = pd.get_dummies(master['categories'].apply(literal_eval).explode()).groupby(level=0).sum()
cui_desc = pd.get_dummies(master['CUI_desc'].apply(literal_eval).explode()).groupby(level=0).sum()
pd.concat([master['annotations'], categories, cui_desc], axis=1)

Output:

              annotations  heroic  opioid  substance_abuse  C0011892___heroin  C0028040___nicotine  C0439234___year  C0748223___QUIT
                  heroine       1       1                1                  1                    0                0                0
                   heroin       1       1                1                  1                    0                0                0
he smoked two packs a day       0       1                1                  0                    1                1                1

CodePudding user response:

Here is another approach using Series.value_counts

import ast 

def row_value_counts(row):
    return row.apply(ast.literal_eval).explode().value_counts()

test = (
    df.set_index("annotations")
      .apply(row_value_counts, axis=1)
      .fillna(0)
      .astype(int)
      .reset_index()
)   
  • Related