Home > Back-end >  Create a feature table in Python from a df
Create a feature table in Python from a df

Time:11-17

I have the following df:

 id    step1 step2 step3 step4 .... stepn-1, stepn, event 
  1       a     b    c     null         null  null    1
  2       b     d    f     null         null  null    0
  3       a     d    g      h             l    m      1      

Where the id is a session, the steps represent a certain path, and event is whether something specific happened

I want to create a feature store where we take all the possible steps (a, b, c, ... all the way to some arbitrary number) and make them the columns. Then I want the x-column to remain the id and it just fill a 1 or zero if that session hit that step in the column. The result is below:

id  a  b  c  d  e  f  g ... n event
 1  1  1  1  0  0  0  0     0   1
 2  0  1  0  0  0  1  0     0   0
 3  1  0  0  1  0  0  1     1   1

I have a unique list of all the possible steps which I assume will be used to construct the new table. But after that I am struggling thinking how to create this.

CodePudding user response:

What you are looking for is often used in machine learning, and is called one-hot encoding.

There is a pandas function specifically designed for this purpose, called pd.get_dummies().

step_cols = [c for c in df.columns if c.startswith('step')]
other_cols = [c for c in df.columns if not c.startswith('step')]

new_df = pd.get_dummies(df[step_cols], prefix='', prefix_sep='')
new_df[other_cols] = df[other_cols]

Output:

>>> new_df
   a  b  b  d  c  f  g  h  l  m  id  event
0  1  0  1  0  1  0  0  0  0  0   1      1
1  0  1  0  1  0  1  0  0  0  0   2      0
2  1  0  0  1  0  0  1  1  1  1   3      1

CodePudding user response:

Probably not the most elegant way:

step_cols = [col for col in df.columns if col.startswith("step")]
values = pd.Series(sorted(set(df[step_cols].melt().value.dropna())))
df1 = pd.DataFrame(
    (values.isin(row).to_list() for row in zip(*(df[col] for col in step_cols))),
    columns=values
).astype(int)
df = pd.concat([df.id, df1, df.event], axis=1)

Result for

df =
   id step1 step2 step3 step4  event
0   1     a     b     c   NaN      1
1   2     b     d     f   NaN      0
2   3     a     d     g     h      1

is

   id  a  b  c  d  f  g  h  event
0   1  1  1  1  0  0  0  0      1
1   2  0  1  0  1  1  0  0      0
2   3  1  0  0  1  0  1  1      1
  • Related