I have data containing a list of topics (topics 1-5; and 0 meaning no topic is assigned) and their value. I want to create a new column for each topic and fill the column with the value. Here's what the table looks like...
reviewId topic value
01 2 -4
02 2 9
03 0 -7
04 5 -1
05 1 38
What should I do to create a table looking like this?
reviewId | topic | value | t1 | t2 | t3 | t4 | t5 |
---|---|---|---|---|---|---|---|
01 | 2 | -4 | 0 | -4 | 0 | 0 | 0 |
02 | 2 | 9 | 0 | 9 | 0 | 0 | 0 |
03 | 0 | -7 | 0 | 0 | 0 | 0 | 0 |
04 | 5 | -1 | 0 | 0 | 0 | 0 | -1 |
05 | 1 | 38 | 38 | 0 | 0 | 0 | 0 |
Here's reproducible data to work on.
raw_df = pd.DataFrame({
'reviewId': ['01', '02', '03', '04', '05'],
'topic': [2, 2, 0, 5, 1],
'value': ['-4', '9', '-7', '-1', '38']})
Here's a link to a similar problem with this one.
CodePudding user response:
df.join(pd.get_dummies(df["topic"])
.reindex(columns=range(1, 5 1), fill_value=0)
.mul(df["value"], axis="rows")
.add_prefix("t"))
- get "1-hot" representation of the "topic" column
- some topics may be missing, e.g., 3, 4 in the sample data, so reindex to guarantee that
- and put zeros to those topics values as they were missing
- now we have a 1/0 dataframe of shape (len(df), len(topics))
- we multiply this with the "value" column row-wise
- 1/0's will act as a "selector" of values
- add "t" from left to the column names for "t1" ... "t5"
and join it with the original frame to get
reviewId topic value t1 t2 t3 t4 t5
0 01 2 -4 0 -4 0 0 0
1 02 2 9 0 9 0 0 0
2 03 0 -7 0 0 0 0 0
3 04 5 -1 0 0 0 0 -1
4 05 1 38 38 0 0 0 0
CodePudding user response:
So A way I figured it out(Might not be the most optimal way is )
First you get all the topics in your original df.
new_df = pd.get_dummies(raw_df['topic'])
The output will be
0 1 2 5
0 0 0 1 0
1 0 0 1 0
2 1 0 0 0
3 0 0 0 1
4 0 1 0 0
Where 0 1 2 5 are your t0, t1, t2, t5 based on unique values in topic.
Now you can set all values to 0 initially.
> Blockquote
new_df.loc[:] = 0
Then you can combine both dfs. So
new_df2 = pd.concat([raw_df, new_df], axis=1)
This will concatenate based on columns.
Output will be
reviewId topic value 0 1 2 5
0 01 2 -4 0 0 0 0
1 02 2 9 0 0 0 0
2 03 0 -7 0 0 0 0
3 04 5 -1 0 0 0 0
4 05 1 38 0 0 0 0
Now you can simply loop and update the values.
for i, x in enumerate(new_df2['topic'].values):
print(i, x)
new_df2.loc[i, int(x)] = new_df2.loc[i, "value"]
So the complete code will be
In [63]: new_df = pd.get_dummies(raw_df['topic'])
...: new_df.loc[:] = 0
...: new_df2 = pd.concat([raw_df, new_df], axis=1)
...: for i, x in enumerate(new_df2['topic'].values):
...: new_df2.loc[i, int(x)] = new_df2.loc[i, "value"]
Hope it helps a bit