Selecting observations by group based on conditions in Python-CodePudding

Imagine that I have a dataframe that looks like this:

import pandas as pd

df1 = pd.DataFrame(columns=['asset', 'eligible', "date"])
df1['asset'] = ['003', '001', '002', '001', '002', '001', '003', '001', '003', '004', '005', '006']
df1['eligible'] = [0, 1, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1]
df1['date'] = ['1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2016', '31.12.2015', '1.1.2016', '1.1.2016', '31.12.2015', '31.12.2015', '1.1.2016', '31.12.2015']

df1["date"] = pd.to_datetime(df1["date"])

df1 = df1.sort_values(by= ["asset", "date"])

print(df1)

I would like to filter by asset group all observations that take the value 1 in 2015 and the value 0 in 2016. How would you do this? In this case, I want to keep observations for the asset groups 001 and 003 and drop asset groups 002 (because asset group 002 has 0 in 2015 and 2016) and asset groups 004, 005 and 006 (because I only have observations for one of the two years). If there are duplicates, I want to keep them such as the 003 in 2016. The filtered dataframe should look like this:


df_final = pd.DataFrame(columns=['asset', 'eligible', "date"])
df_final['asset'] = ['001', '001', '001', '001', '003', '003', '003']
df_final['eligible'] = [1, 1, 0, 0, 1, 0, 0]
df_final['date'] = ['31.12.2015', '31.12.2015', '1.1.2016', '1.1.2016', '31.12.2015', '31.12.2015']

print(df_final)

CodePudding user response：

Test rows with match both conditions with numpy.intersect1d and filter by isin:

m1 = df1['eligible'].eq(0) & df1['date'].dt.year.eq(2016)
m2 = df1['eligible'].eq(1) & df1['date'].dt.year.eq(2015)

df1 = df1[m1 | m2]

g = np.intersect1d(df1.loc[m1,'asset'], df1.loc[m2,'asset'])
df = df1[df1['asset'].isin(g)]
print(df)
  asset  eligible       date
1   001         1 2015-12-31
3   001         1 2015-12-31
7   001         0 2016-01-01
8   003         1 2015-12-31
0   003         0 2016-01-01
6   003         0 2016-01-01