I have the following dictionary
d = {
"anna": ["apple", "strawberry", "banana"],
"bob": ["strawberry", "banana", "peach"],
"chris": ["apple", "banana", "peach", "mango"]
}
and I want to convert it into the following pandas.DataFrame
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0
It is not very complicated to implement in Python (see below), but I was wondering if there is already something in pandas
to do it automatically (or if the implementation below can be optimized)
Thanks in advance!
Python current implementation
import numpy as np
import pandas as pd
d = {
"anna": ["apple", "strawberry", "banana"],
"bob": ["strawberry", "banana", "peach"],
"chris": ["apple", "banana", "peach", "mango"]
}
fruits = sorted(set(np.hstack(d.values())))
df = pd.DataFrame(columns=fruits)
for client, client_fruits in d.items():
s = pd.Series({
fruit: fruit in client_fruits for fruit in fruits
}).astype(int)
df = pd.concat([df, pd.DataFrame({client: s}).T])
print(df)
CodePudding user response:
One option using str.get_dummies
:
out = pd.Series({k: '|'.join(v) for k,v in d.items()}).str.get_dummies()
Or from_dict
and pandas.get_dummies
:
out = (pd.get_dummies(pd.DataFrame.from_dict(d, orient='index').stack())
.groupby(level=0).max()
)
Or with a crosstab
:
out = pd.crosstab(*zip(*((k,v) for k,l in d.items() for v in l))).clip(upper=1)
Output:
apple banana mango peach strawberry
anna 1 1 0 0 1
bob 0 1 0 1 1
chris 1 1 1 1 0