Home > Mobile >  Get data that occur in all specified years of column in Pandas
Get data that occur in all specified years of column in Pandas

Time:02-24

I need only those company name with year and estimate grp size those are present in all three year eg 2019, 2020, 2021

year    Company/Account Name    EstimatedGroupSize
0   2019    Unknown             19550
1   2019    Mayo Clinic         7754
2   2019    Deloitte            6432
3   2019    Rizona State        5582
4   2019    Intel Corporation   4595
5   2020    Deloitte            4063
6   2020    Unknown             3490
7   2021    Unknown             3484
8   2020    Intel Corporation   3460
9   2021    Intel Corporation   3433
10  2021    Deloitte            3250

So my output should be

year    Company/Account Name    EstimatedGroupSize
0   2019    Unknown             19550
2   2019    Deloitte            6432
4   2019    Intel Corporation   4595
5   2020    Deloitte            4063
6   2020    Unknown             3490
7   2021    Unknown             3484
8   2020    Intel Corporation   3460
9   2021    Intel Corporation   3433
10  2021    Deloitte            3250

CodePudding user response:

Here is solution for filter year with Company/Account Name if present at least one row and filter original DataFrame by inner merge:

#if need filter ony some years first
df = df[df['year'].isin([2019, 2020, 2021])]

df1 = pd.crosstab(df['year'], df['Company/Account Name'])

df = df.merge(df1.loc[:, df1.gt(0).all()].stack().index.to_frame(index=False))
print (df)
   year Company/Account Name  EstimatedGroupSize
0  2019              Unknown               19550
1  2019             Deloitte                6432
2  2019    Intel Corporation                4595
3  2020             Deloitte                4063
4  2020              Unknown                3490
5  2021              Unknown                3484
6  2020    Intel Corporation                3460
7  2021    Intel Corporation                3433
8  2021             Deloitte                3250

CodePudding user response:

IIUC,

years = [2019, 2020, 2021]
new_df = \
df.loc[pd.get_dummies(df['year'])
         .groupby(df['Company/Account Name'])[years]
         .transform('sum')
         .gt(0)
         .all(axis=1)]
print(new_df)

    year Company/Account Name  EstimatedGroupSize
0   2019              Unknown               19550
2   2019             Deloitte                6432
4   2019    Intel-Corporation                4595
5   2020             Deloitte                4063
6   2020              Unknown                3490
7   2021              Unknown                3484
8   2020    Intel-Corporation                3460
9   2021    Intel-Corporation                3433
10  2021             Deloitte                3250

Or:

years = [2019, 2020, 2021]
new_df = \
df.groupby('Company/Account Name')\
  .filter(lambda x: np.isin(years, x['year']).all())
  • Related