I am using pd.get_dummies on this example dataframe below- and it's working properly but I want to see if anyone has an idea of how to alter the results. I'll describe below:
Original DF
ID type
AA23 A
AB24 B
DJ44 B
KD33 C
KD33 A
BK89 B
BL92 B
BL92 C
IO89 A
df after applying: pd.get_dummies(df, columns = ['type'],prefix = 'type')
ID type_A type_B type_C
AA23 1 0 0
AB24 0 1 0
DJ44 0 1 0
KD33 0 0 1
KD33 1 0 0
BK89 0 1 0
BL92 0 1 0
BL92 0 0 1
IO89 0 0 0
What I'm looking to for is similar but, for cases where there are 2 or more ID's (i.e. KD33 or BL92), I want just one line per ID and the associated type columns marked with 1. For example with ID = KD33, I want one line where 'type_A' and 'type_C' have 1.
ID type_A type_B type_C
AA23 1 0 0
AB24 0 1 0
DJ44 0 1 0
KD33 1 0 1
BK89 0 1 0
BL92 0 1 1
IO89 0 0 0
CodePudding user response:
One option is to just do the whole thing with a .groupby()
:
In [36]: df.groupby(["ID", "type"]).agg(lambda x: 1).unstack().fillna(0).astype(int).add_prefix("type_")
Out[36]:
type type_A type_B type_C
ID
AA23 1 0 0
AB24 0 1 0
BK89 0 1 0
BL92 0 1 1
DJ44 0 1 0
IO89 1 0 0
KD33 1 0 1
You can also just tack the .groupby
on to the end of the get_dummies
version:
In [37]: pd.get_dummies(df, columns = ['type'],prefix = 'type').groupby("ID").sum()
Out[37]:
type_A type_B type_C
ID
AA23 1 0 0
AB24 0 1 0
BK89 0 1 0
BL92 0 1 1
DJ44 0 1 0
IO89 1 0 0
KD33 1 0 1
On this small example, the first version is slightly faster but needs more massaging to get the format the same:
In [48]: %timeit df.groupby(["ID", "type"]).agg(lambda x: 1).unstack().fillna(0).astype(int).add_prefix("type_")
1.3 ms ± 8.61 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
In [49]: %timeit pd.get_dummies(df, columns = ['type'],prefix = 'type').groupby("ID").sum()
1.66 ms ± 1.48 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
CodePudding user response:
You can use groupby.size
with unstack
:
print (df.groupby(["ID", "type"]).size().unstack(fill_value=0))
type A B C
ID
AA23 1 0 0
AB24 0 1 0
BK89 0 1 0
BL92 0 1 1
DJ44 0 1 0
IO89 1 0 0
KD33 1 0 1