Imagine that I have a dataframe that looks like this:
import pandas as pd
df1 = pd.DataFrame(columns=['asset', 'eligible', "date"])
df1['asset'] = ['003', '001', '002', '001', '002', '001', '003', '001', '003', '004', '005', '006']
df1['eligible'] = [0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1]
df1['date'] = ['1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2016', '1.1.2016', '31.12.2015', '31.12.2015', '1.1.2016', '31.12.2015']
df1["date"] = pd.to_datetime(df1["date"])
df1 = df1.sort_values(by= ["asset", "date"])
print(df1)
I would like to filter by asset group all observations that take the value 1 in 2015 and the value 0 in 2016. How would you do this? In this case, I want to keep observations for the asset groups 001 and 003 and drop asset groups 002 (because asset group 002 has 0 in 2015 and 2016) and asset groups 004, 005 and 006 (because I only have observations for one of the two years). If there are duplicates, I want to keep them such as the 003 in 2016. The filtered dataframe should look like this:
df_final = pd.DataFrame(columns=['asset', 'eligible', "date"])
df_final['asset'] = ['001', '001', '001', '001', '003', '003', '003']
df_final['eligible'] = [1, 1, 0, 0, 1, 0, 0]
df_final['date'] = ['31.12.2015', '31.12.2015', '1.1.2016', '1.1.2016', '31.12.2015', '31.12.2015']
print(df_final)
CodePudding user response:
Test rows with match both conditions with numpy.intersect1d
and filter by isin
:
m1 = df1['eligible'].eq(0) & df1['date'].dt.year.eq(2016)
m2 = df1['eligible'].eq(1) & df1['date'].dt.year.eq(2015)
df1 = df1[m1 | m2]
g = np.intersect1d(df1.loc[m1,'asset'], df1.loc[m2,'asset'])
df = df1[df1['asset'].isin(g)]
print(df)
asset eligible date
1 001 1 2015-12-31
3 001 1 2015-12-31
7 001 0 2016-01-01
8 003 1 2015-12-31
0 003 0 2016-01-01
6 003 0 2016-01-01