Home > Enterprise >  Create a dummy column based on a different column
Create a dummy column based on a different column

Time:08-19

I have panel data and want to create a column "active trader" for each ID for each period, if the ID has at least traded once per quarter consecutively

current df

ID  date   trading   
A   2020Q1 4         
A   2020Q2 5         
A   2020Q3 0         
A   2020Q4 2         
A   2021Q1 1         
B   2019Q1 0         
B   2019Q2 1         
B   2019Q3 2         
C   2021Q1 3         
C   2021Q2 3         
C   2021Q3 4         
C   2021Q4 0         
...

desired

ID  date   trading   active
A   2020Q1 4         1
A   2020Q2 5         1
A   2020Q3 0         0
A   2020Q4 2         0
A   2021Q1 1         0
B   2019Q1 0         0
B   2019Q2 1         0
B   2019Q3 2         0
C   2021Q1 3         1
C   2021Q2 3         1
C   2021Q3 4         1
C   2021Q4 0         0
...

CodePudding user response:

You could try as follows:

import pandas as pd
import numpy as np

data  = {'ID': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
         'date': ['2020Q1','2020Q2','2020Q3','2020Q4','2021Q1','2019Q1','2019Q2','2019Q3','2021Q1','2021Q2','2021Q3','2021Q4'],
         'trading': [4, 5, 0, 2, 1, 0, 1, 2, 3, 3, 4, 0],
         'active': [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]}
df = pd.DataFrame(data)

df_desired = df.copy()
df_desired.drop('active', inplace=True, axis=1)

df_desired['active'] = df_desired.groupby(['ID'])['trading'].cummin().gt(0).astype(int)

# there's a difference in dtype (int64 -> np.int32)
df['active'] = df_desired['active'].astype(np.int32)

# check if result matches desired output:
df.equals(df_desired) # True

Explanation. df.cummin can be used to return cumulative minimum for the traders within each group:

print(df_desired.groupby(['ID'])['trading'].cummin())

0     4
1     4
2     0
3     0
4     0
5     0
6     0
7     0
8     3
9     3
10    3
11    0
Name: trading, dtype: int64

So, this is a quick way to fill down everything with 0, as soon as we hit the first one. Next, we simply check larger than 0, and convert the resulting pd.Series with True/False to 1/0 using .astype(int). So, the final result becomes:

print(df_desired.groupby(['ID'])['trading'].cummin().gt(0).astype(int))

0     1
1     1
2     0
3     0
4     0
5     0
6     0
7     0
8     1
9     1
10    1
11    0
Name: trading, dtype: int32

CodePudding user response:

If I understood correctly, the periods are consecutive. Then this works:

import pandas as pd
dataframe = pd.DataFrame({"ID": ["A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
                          "date": ["2020Q1", "2020Q2", "2020Q3", "2020Q4", "2021Q1", "2019Q1", "2019Q2", "2019Q3", "2021Q1", "2021Q2", "2021Q3", "2021Q4"],
                          "trading": [4, 5, 0, 2, 1, 0, 1, 2, 3, 3, 4, 0]})

dataframe = dataframe.sort_values(["ID", "date"])

active = []
for index, row in dataframe.iterrows():
    # 0 means no active trading
    active_trading = 0
    id_dataframe = dataframe.loc[dataframe.ID==row.ID]
    if all(id_dataframe.loc[id_dataframe.index <= index].trading > 0):
        # 1 means active trading
        active_trading = 1
    active.append(active_trading)
dataframe["active"] = active
dataframe

Gives


ID  date    trading active
A   2020Q1  4       1
A   2020Q2  5       1
A   2020Q3  0       0
A   2020Q4  2       0
A   2021Q1  1       0
B   2019Q1  0       0
B   2019Q2  1       0
B   2019Q3  2       0
C   2021Q1  3       1
C   2021Q2  3       1
C   2021Q3  4       1
C   2021Q4  0       0
  • Related