I have a dataframe consisting of online reviews. I have assigned topics (topic 1-5; and 0 meaning no topic is assigned) and labels (positive or negative) in each instance. I want to create a dummy variable for each topic and label. This is what my data looks like...
reviewId | topic | label |
---|---|---|
01 | 2 | negative |
02 | 2 | positive |
03 | 0 | negative |
04 | 5 | negative |
05 | 1 | positive |
What should I do to make my data look like this? (1 meaning assigned, 0 meaning not assigned)
reviewId | topic | label | T1pos | T1neg | T2pos | T2neg | T3pos | T3neg | T4pos | T4neg | T5pos | T5neg |
---|---|---|---|---|---|---|---|---|---|---|---|---|
01 | 2 | negative | 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 |
02 | 2 | positive | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
03 | 0 | negative | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
04 | 5 | negative | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
05 | 1 | positive | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
CodePudding user response:
You can create your own encoding by converting the two columns to a power of two and get its binary representation:
# I used 'p' as 'pos' and 'n' as 'neg' to save space
MAX_TOPIC = df['topic'].max()
mi = pd.MultiIndex.from_product([range(1, MAX_TOPIC 1), ['p', 'n']])
mi = [f'T{t}{l}' for t, l in mi]
# >> 2 to remove T0n and T0p
num = np.array(2**(df['topic']*2 df['label'].eq('negative'))) >> 2
hot = (((n[:, None] & (1 << np.arange(MAX_TOPIC*2)))) > 0).astype(int)
out = pd.concat([df, pd.DataFrame(hot, columns=mi, index=df.index)], axis=1)
Output:
>>> out
reviewId topic label T1p T1n T2p T2n T3p T3n T4p T4n T5p T5n
0 1 2 negative 0 0 0 1 0 0 0 0 0 0
1 2 2 positive 0 0 1 0 0 0 0 0 0 0
2 3 0 negative 0 0 0 0 0 0 0 0 0 0
3 4 5 negative 0 0 0 0 0 0 0 0 0 1
4 5 1 positive 1 0 0 0 0 0 0 0 0 0
>>> num
array([ 8, 4, 0, 512, 1])
The binary representation comes from Convert integer to binary array with suitable padding
CodePudding user response:
Someone can probably come up with a more elegant solution, but this works:
import numpy as np
import pandas as pd
# recreate your DataFrame:
df = pd.DataFrame({
'reviewid': ['01', '02', '03', '04', '05'],
'topic': [2, 2, 0, 5, 1],
'label': ['neg', 'pos', 'neg', 'neg', 'pos']})
# Add dummy columns initialized to 0:
dummies = [
f'T{t}{lab}' for t in sorted(df.topic.unique()) if t != 0
for lab in sorted(df.label.unique())]
dummy_df = pd.DataFrame(
np.zeros((len(df), len(dummies)), dtype=int),
columns=dummies,
index=df.index)
df = pd.concat([df, dummy_df], axis=1)
# Fill in the dummy columns
for i, (t, lab) in enumerate(zip(df.topic, df.label)):
if t != 0:
df.loc[i, f'T{t}{lab}'] = 1
df # view result