Pandas filtering based on minimum data occurrences across multiple columns-CodePudding

I have a dataframe like this

country     data_fingerprint   organization     
US          111                Tesco         
UK          222                IBM          
US          111                Yahoo           
PY          333                Tesco
US          111                Boeing   
CN          333                TCS  
NE          458                Yahoo
UK          678                Tesco

I want those data_fingerprint for where the organisation and country with top 2 counts exists

So if see in organization top 2 occurrences are for Tesco,Yahoo and for country we have US,UK .

So based on that the output of data_fingerprint should be having

data_fingerprint
111
678

What i have tried for organization to exist in my complete dataframe is this

# First find top 2 occurances of organization
nd = df['organization'].value_counts().groupby(level=0, group_keys=False).head(2)
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)

But i am not getting any data here.Once i get data for this I can do it along with country Can someone please help to get me the output.I have less data so using Pandas

CodePudding user response：

Annotated code

# find top 2 most occurring country and organization
i1 = df['country'].value_counts().index[:2]
i2 = df['organization'].value_counts().index[:2]

# Create boolean mask to select the rows having top 2 country and org.
mask = df['country'].isin(i1) & df['organization'].isin(i2)

# filter the rows using the mask and drop dupes in data_fingerprint
df.loc[mask, ['data_fingerprint']].drop_duplicates()

Result

   data_fingerprint
0               111
7               678

CodePudding user response：

here is one way to do it

df[
    df['organization'].isin(df['organization'].value_counts().head(2).index) &
    df['country'].isin(df['country'].value_counts().head(2).index)
]['data_fingerprint'].unique()

array([111, 678], dtype=int64)

CodePudding user response：

You can do

# First find top 2 occurances of organization
nd = df['organization'].value_counts().head(2).index
# Then checking if the organization exist in the complete dataframe and filtering those rows
new = df["organization"].isin(nd)

Output - Only Tesco and Yahoo left

df[new]

    country data_fingerprint    organization
0        US              111           Tesco
2        US              111           Yahoo
3        PY              333           Tesco
6        NE              458           Yahoo
7        UK              678           Tesco

You can do the same for country