Home > Blockchain >  How do you create a dummy variable based on the presence of a string by date, but on a different row
How do you create a dummy variable based on the presence of a string by date, but on a different row

Time:08-13

For context, I am trying to measure using regression whether the presence of a competitor ad in a given week affects the metrics of an ad. I am not sure how to consolidate the weeks , or assign Boolean values (1 or 0) based on the presence of a brand during a week, but on a different row.

import pandas as pd

df = pd.DataFrame({'week': ['2019-11-11', '2019-11-11', '2019-11-18', '2019-11-25', '2019-11-11', '2019-11-18', '2019-11-11'],
                   'brand':['X', 'X-2', 'X', 'X', 'Y', 'Y', 'Z'],
                   'score': [.34, .25, .54, .23, .22, .34, .44]}) 

Desired result:

df = pd.DataFrame({'week': ['2019-11-11', '2019-11-11', '2019-11-18', '2019-11-25', '2019-11-11', '2019-11-18', '2019-11-11'],
                   'brand':['X', 'X-2', 'X', 'X', 'Y', 'Y', 'Z'],
                   'score': [.34, .25, .54, .23, .22, .34, .44],
                   'presence_dummy_Y': [1, 1, 1, 0, 1, 1, 1],
                  'presence_dummy_Z': [1, 1, 0, 0, 1, 0, 1]}) 

CodePudding user response:

You can get_dummies, filter the competitors with loc/filter, get 1 if there is at least a 1 in the week with GroupBy.max:

df.join(pd
 .get_dummies(df['brand'])  # transform to dummies
 .filter(regex='^(?!X)')  # keep only brands not starting with X
 .groupby(pd.to_datetime(df['week']).dt.to_period('W')) # groupby week
 .transform('max')      # 1 if at least a 1
 .add_prefix('dummy_')  # rename columns
)

output:

         week brand  score  dummy_Y  dummy_Z
0  2019-11-11     X   0.34        1        1
1  2019-11-11   X-2   0.25        1        1
2  2019-11-18     X   0.54        1        0
3  2019-11-25     X   0.23        0        0
4  2019-11-11     Y   0.22        1        1
5  2019-11-18     Y   0.34        1        0
6  2019-11-11     Z   0.44        1        1

CodePudding user response:

Let's try

out = df.join(df['brand'].str.get_dummies()
              .groupby(df['week']).transform('any').astype(int)
              .pipe(lambda df: df.filter(regex='Y|Z'))
              .add_prefix('presence_dummy_'))
print(out)

         week brand  score  presence_dummy_Y  presence_dummy_Z
0  2019-11-11     X   0.34                 1                 1
1  2019-11-11   X-2   0.25                 1                 1
2  2019-11-18     X   0.54                 1                 0
3  2019-11-25     X   0.23                 0                 0
4  2019-11-11     Y   0.22                 1                 1
5  2019-11-18     Y   0.34                 1                 0
6  2019-11-11     Z   0.44                 1                 1
  • Related