I have a dataframe with 3 columns, the first one (annotations) is what I want to measure presence absence on and the latter two columns (categories, CUI_desc) contain comma separated lists of factors that I would like to become columns for a presence/absence dataframe
Currently, the data looks like this:
annotations categories CUI_desc
heroine ['heroic', 'opioid', 'substance_abuse'] ['C0011892___heroin']
heroin ['heroic', 'opioid', 'substance_abuse'] ['C0011892___heroin']
he smoked two packs a day ['opioid', 'substance_abuse'] ['C0439234___year', 'C0748223___QUIT', 'C0028040___nicotine']
And I would like it to look like this:
annotations heroic opioid substance_abuse COO1892___heroin CO439234___year CO748223___QUIT COO22840___nicotine
heroine 1 1 1 1 0 0 0
heroin 1 1 1 1 0 0 0
he smoked two packs a day 0 1 1 0 1 1 1
I used this line of code from a similar question:
from collections import Counter
test = pd.DataFrame({k:Counter(v) for k, v in master.items()}).T.fillna(0).astype(int)
But got an undesired output:
heroine heroin he smoked two packs a day
annotations 1 1 1
categories 0 0 0
CUI_desc 0 0 0
It seems to be counting how many times a certain annotations shows up in my dataframe. This is likely because the above block of code is for a dictionary and not a dataframe.
CodePudding user response:
Edit: OP clarified that each cell is a string so we need to convert it into a list first before calling explode
.
Assuming the index is unique:
from ast import literal_eval
categories = pd.get_dummies(master['categories'].apply(literal_eval).explode()).groupby(level=0).sum()
cui_desc = pd.get_dummies(master['CUI_desc'].apply(literal_eval).explode()).groupby(level=0).sum()
pd.concat([master['annotations'], categories, cui_desc], axis=1)
Output:
annotations heroic opioid substance_abuse C0011892___heroin C0028040___nicotine C0439234___year C0748223___QUIT
heroine 1 1 1 1 0 0 0
heroin 1 1 1 1 0 0 0
he smoked two packs a day 0 1 1 0 1 1 1
CodePudding user response:
Here is another approach using Series.value_counts
import ast
def row_value_counts(row):
return row.apply(ast.literal_eval).explode().value_counts()
test = (
df.set_index("annotations")
.apply(row_value_counts, axis=1)
.fillna(0)
.astype(int)
.reset_index()
)