I have panel data and want to create a column "active trader" for each ID for each period, if the ID has at least traded once per quarter consecutively
current df
ID date trading
A 2020Q1 4
A 2020Q2 5
A 2020Q3 0
A 2020Q4 2
A 2021Q1 1
B 2019Q1 0
B 2019Q2 1
B 2019Q3 2
C 2021Q1 3
C 2021Q2 3
C 2021Q3 4
C 2021Q4 0
...
desired
ID date trading active
A 2020Q1 4 1
A 2020Q2 5 1
A 2020Q3 0 0
A 2020Q4 2 0
A 2021Q1 1 0
B 2019Q1 0 0
B 2019Q2 1 0
B 2019Q3 2 0
C 2021Q1 3 1
C 2021Q2 3 1
C 2021Q3 4 1
C 2021Q4 0 0
...
CodePudding user response:
You could try as follows:
import pandas as pd
import numpy as np
data = {'ID': ['A', 'A', 'A', 'A', 'A', 'B', 'B', 'B', 'C', 'C', 'C', 'C'],
'date': ['2020Q1','2020Q2','2020Q3','2020Q4','2021Q1','2019Q1','2019Q2','2019Q3','2021Q1','2021Q2','2021Q3','2021Q4'],
'trading': [4, 5, 0, 2, 1, 0, 1, 2, 3, 3, 4, 0],
'active': [1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0]}
df = pd.DataFrame(data)
df_desired = df.copy()
df_desired.drop('active', inplace=True, axis=1)
df_desired['active'] = df_desired.groupby(['ID'])['trading'].cummin().gt(0).astype(int)
# there's a difference in dtype (int64 -> np.int32)
df['active'] = df_desired['active'].astype(np.int32)
# check if result matches desired output:
df.equals(df_desired) # True
Explanation. df.cummin
can be used to return cumulative minimum for the traders within each group:
print(df_desired.groupby(['ID'])['trading'].cummin())
0 4
1 4
2 0
3 0
4 0
5 0
6 0
7 0
8 3
9 3
10 3
11 0
Name: trading, dtype: int64
So, this is a quick way to fill down everything with 0
, as soon as we hit the first one. Next, we simply check larger than 0
, and convert the resulting pd.Series
with True/False
to 1/0
using .astype(int)
. So, the final result becomes:
print(df_desired.groupby(['ID'])['trading'].cummin().gt(0).astype(int))
0 1
1 1
2 0
3 0
4 0
5 0
6 0
7 0
8 1
9 1
10 1
11 0
Name: trading, dtype: int32
CodePudding user response:
If I understood correctly, the periods are consecutive. Then this works:
import pandas as pd
dataframe = pd.DataFrame({"ID": ["A", "A", "A", "A", "A", "B", "B", "B", "C", "C", "C", "C"],
"date": ["2020Q1", "2020Q2", "2020Q3", "2020Q4", "2021Q1", "2019Q1", "2019Q2", "2019Q3", "2021Q1", "2021Q2", "2021Q3", "2021Q4"],
"trading": [4, 5, 0, 2, 1, 0, 1, 2, 3, 3, 4, 0]})
dataframe = dataframe.sort_values(["ID", "date"])
active = []
for index, row in dataframe.iterrows():
# 0 means no active trading
active_trading = 0
id_dataframe = dataframe.loc[dataframe.ID==row.ID]
if all(id_dataframe.loc[id_dataframe.index <= index].trading > 0):
# 1 means active trading
active_trading = 1
active.append(active_trading)
dataframe["active"] = active
dataframe
Gives
ID date trading active
A 2020Q1 4 1
A 2020Q2 5 1
A 2020Q3 0 0
A 2020Q4 2 0
A 2021Q1 1 0
B 2019Q1 0 0
B 2019Q2 1 0
B 2019Q3 2 0
C 2021Q1 3 1
C 2021Q2 3 1
C 2021Q3 4 1
C 2021Q4 0 0