Home > Blockchain >  Create dummy variabel and fill based on condition
Create dummy variabel and fill based on condition

Time:01-05

I have a dataframe consisting of online reviews. I have assigned topics (topic 1-5; and 0 meaning no topic is assigned) and labels (positive or negative) in each instance. I want to create a dummy variable for each topic and label. This is what my data looks like...

reviewId topic label
01 2 negative
02 2 positive
03 0 negative
04 5 negative
05 1 positive

What should I do to make my data look like this? (1 meaning assigned, 0 meaning not assigned)

reviewId topic label T1pos T1neg T2pos T2neg T3pos T3neg T4pos T4neg T5pos T5neg
01 2 negative 0 0 0 1 0 0 0 0 0 0
02 2 positive 0 0 1 0 0 0 0 0 0 0
03 0 negative 0 0 0 0 0 0 0 0 0 0
04 5 negative 0 0 0 0 0 0 0 0 0 1
05 1 positive 1 0 0 0 0 0 0 0 0 0

CodePudding user response:

You can create your own encoding by converting the two columns to a power of two and get its binary representation:

# I used 'p' as 'pos' and 'n' as 'neg' to save space
MAX_TOPIC = df['topic'].max()
mi = pd.MultiIndex.from_product([range(1, MAX_TOPIC 1), ['p', 'n']])
mi = [f'T{t}{l}' for t, l in mi]

# >> 2 to remove T0n and T0p
num = np.array(2**(df['topic']*2 df['label'].eq('negative'))) >> 2
hot = (((n[:, None] & (1 << np.arange(MAX_TOPIC*2)))) > 0).astype(int)

out = pd.concat([df, pd.DataFrame(hot, columns=mi, index=df.index)], axis=1)

Output:

>>> out
   reviewId  topic     label  T1p  T1n  T2p  T2n  T3p  T3n  T4p  T4n  T5p  T5n
0         1      2  negative    0    0    0    1    0    0    0    0    0    0
1         2      2  positive    0    0    1    0    0    0    0    0    0    0
2         3      0  negative    0    0    0    0    0    0    0    0    0    0
3         4      5  negative    0    0    0    0    0    0    0    0    0    1
4         5      1  positive    1    0    0    0    0    0    0    0    0    0

>>> num
array([  8,   4,   0, 512,   1])

The binary representation comes from Convert integer to binary array with suitable padding

CodePudding user response:

Someone can probably come up with a more elegant solution, but this works:

import numpy as np
import pandas as pd

# recreate your DataFrame:
df = pd.DataFrame({
    'reviewid': ['01', '02', '03', '04', '05'],
    'topic': [2, 2, 0, 5, 1],
    'label': ['neg', 'pos', 'neg', 'neg', 'pos']})

# Add dummy columns initialized to 0:
dummies = [
    f'T{t}{lab}' for t in sorted(df.topic.unique()) if t != 0 
    for lab in sorted(df.label.unique())]
dummy_df = pd.DataFrame(
    np.zeros((len(df), len(dummies)), dtype=int),
    columns=dummies,
    index=df.index)
df = pd.concat([df, dummy_df], axis=1)

# Fill in the dummy columns
for i, (t, lab) in enumerate(zip(df.topic, df.label)):
    if t != 0:
        df.loc[i, f'T{t}{lab}'] = 1

df  # view result
  • Related