Home > Blockchain >  Create dummy column and input value from other column
Create dummy column and input value from other column

Time:01-16

I have data containing a list of topics (topics 1-5; and 0 meaning no topic is assigned) and their value. I want to create a new column for each topic and fill the column with the value. Here's what the table looks like...

reviewId   topic   value
      01       2      -4
      02       2       9
      03       0      -7
      04       5      -1
      05       1      38

What should I do to create a table looking like this?

reviewId topic value t1 t2 t3 t4 t5
01 2 -4 0 -4 0 0 0
02 2 9 0 9 0 0 0
03 0 -7 0 0 0 0 0
04 5 -1 0 0 0 0 -1
05 1 38 38 0 0 0 0

Here's reproducible data to work on.

raw_df = pd.DataFrame({
    'reviewId': ['01', '02', '03', '04', '05'],
    'topic': [2, 2, 0, 5, 1],
    'value': ['-4', '9', '-7', '-1', '38']})

Here's a link to a similar problem with this one.

CodePudding user response:

df.join(pd.get_dummies(df["topic"])
          .reindex(columns=range(1, 5 1), fill_value=0)
          .mul(df["value"], axis="rows")
          .add_prefix("t"))
  • get "1-hot" representation of the "topic" column
  • some topics may be missing, e.g., 3, 4 in the sample data, so reindex to guarantee that
    • and put zeros to those topics values as they were missing
  • now we have a 1/0 dataframe of shape (len(df), len(topics))
    • we multiply this with the "value" column row-wise
    • 1/0's will act as a "selector" of values
  • add "t" from left to the column names for "t1" ... "t5"

and join it with the original frame to get

  reviewId  topic  value  t1  t2  t3  t4  t5
0       01      2     -4   0  -4   0   0   0
1       02      2      9   0   9   0   0   0
2       03      0     -7   0   0   0   0   0
3       04      5     -1   0   0   0   0  -1
4       05      1     38  38   0   0   0   0

CodePudding user response:

So A way I figured it out(Might not be the most optimal way is )

First you get all the topics in your original df.

new_df = pd.get_dummies(raw_df['topic'])

The output will be

   0  1  2  5
0  0  0  1  0
1  0  0  1  0
2  1  0  0  0
3  0  0  0  1
4  0  1  0  0

Where 0 1 2 5 are your t0, t1, t2, t5 based on unique values in topic.

Now you can set all values to 0 initially.


> Blockquote

new_df.loc[:] = 0

Then you can combine both dfs. So

new_df2 = pd.concat([raw_df, new_df], axis=1)

This will concatenate based on columns.

Output will be

  reviewId  topic value  0  1  2  5
0       01      2    -4  0  0  0  0
1       02      2     9  0  0  0  0
2       03      0    -7  0  0  0  0
3       04      5    -1  0  0  0  0
4       05      1    38  0  0  0  0

Now you can simply loop and update the values.

for i, x in enumerate(new_df2['topic'].values):
     print(i, x)
     new_df2.loc[i, int(x)] = new_df2.loc[i, "value"]

So the complete code will be

In [63]: new_df = pd.get_dummies(raw_df['topic'])
    ...: new_df.loc[:] = 0
    ...: new_df2 = pd.concat([raw_df, new_df], axis=1)
    ...: for i, x in enumerate(new_df2['topic'].values):
    ...:      new_df2.loc[i, int(x)] = new_df2.loc[i, "value"]

Hope it helps a bit

  • Related