I have the following df:
id step1 step2 step3 step4 .... stepn-1, stepn, event
1 a b c null null null 1
2 b d f null null null 0
3 a d g h l m 1
Where the id is a session, the steps represent a certain path, and event is whether something specific happened
I want to create a feature store where we take all the possible steps (a, b, c, ... all the way to some arbitrary number) and make them the columns. Then I want the x-column to remain the id
and it just fill a 1 or zero if that session hit that step in the column. The result is below:
id a b c d e f g ... n event
1 1 1 1 0 0 0 0 0 1
2 0 1 0 0 0 1 0 0 0
3 1 0 0 1 0 0 1 1 1
I have a unique list of all the possible steps which I assume will be used to construct the new table. But after that I am struggling thinking how to create this.
CodePudding user response:
What you are looking for is often used in machine learning, and is called one-hot encoding.
There is a pandas function specifically designed for this purpose, called pd.get_dummies()
.
step_cols = [c for c in df.columns if c.startswith('step')]
other_cols = [c for c in df.columns if not c.startswith('step')]
new_df = pd.get_dummies(df[step_cols], prefix='', prefix_sep='')
new_df[other_cols] = df[other_cols]
Output:
>>> new_df
a b b d c f g h l m id event
0 1 0 1 0 1 0 0 0 0 0 1 1
1 0 1 0 1 0 1 0 0 0 0 2 0
2 1 0 0 1 0 0 1 1 1 1 3 1
CodePudding user response:
Probably not the most elegant way:
step_cols = [col for col in df.columns if col.startswith("step")]
values = pd.Series(sorted(set(df[step_cols].melt().value.dropna())))
df1 = pd.DataFrame(
(values.isin(row).to_list() for row in zip(*(df[col] for col in step_cols))),
columns=values
).astype(int)
df = pd.concat([df.id, df1, df.event], axis=1)
Result for
df =
id step1 step2 step3 step4 event
0 1 a b c NaN 1
1 2 b d f NaN 0
2 3 a d g h 1
is
id a b c d f g h event
0 1 1 1 1 0 0 0 0 1
1 2 0 1 0 1 1 0 0 0
2 3 1 0 0 1 0 1 1 1