Home > Net >  random sampling of many groups within dataframe IF group size is greater than a value
random sampling of many groups within dataframe IF group size is greater than a value

Time:03-29

I'm a newbie. Consider the following. For any LOC_ID that has more than 3 unique SUB_IDs, I take a random sample of 3 unique SUB_IDs to piece together a new dataframe with concatenate:

dataframe_in

df_concat = pd.DataFrame(columns=['LOC_ID', 'SUB_ID'])

for loc in test_df['LOC_ID'].unique():
    sub_ids = test_df[test_df['LOC_ID'] == loc]['SUB_ID'].unique()
    if len(sub_ids) > 3:
        np.random.seed(622)
        sub_ids_max = np.random.choice(sub_ids, size=3, replace=False)
        df_sample = test_df[test_df['SUB_ID'].isin(sub_ids_max)]
    else:
        df_sample = test_df[test_df['SUB_ID'].isin(sub_ids)]
    df_concat = pd.concat([df_concat, df_sample], ignore_index=True)

dataframe_out

The following is the real case where I also have a year component with 6 years. Each LOC_ID within locids_sub has more than 10 unique SUB_IDs in at least one year. Per LOC_ID, per year, I need to ensure that there are no more than 10 unique SUB_ID's:

max_subid = 10
years = df_bb['Year'].unique()

count_df = df_bb.groupby(['LOC_ID', 'Year']).nunique().reset_index()
locids_sub = count_df[count_df['SUB_ID'] > max_subid]['LOC_ID'].unique()

# Subset df_bb by locids_sub
df_bb_sub = df_bb[df_bb['LOC_ID'].isin(locids_sub)][['Year', 'LOC_ID', 'SUB_ID']]

df_concat = pd.DataFrame(columns=df_bb.columns)  # Initializing df

for year in years:
    for loc in locids_sub:
        sub_ids = df_bb_sub[(df_bb_sub['Year'] == year) & (df_bb_sub['LOC_ID'] == loc)]['SUB_ID'].unique()
        if len(sub_ids) > max_subid:
            np.random.seed(year int(loc[-2:]))
            sub_ids_max = np.random.choice(sub_ids, size=max_subid, replace=False)
            df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids_max)]
        else:
            df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids)]
        df_concat = pd.concat([df_concat, df_sample], ignore_index=True)

With 6 years, 1460 LOC_IDs in locids_sub, and 1828201 rows in df_bb, this takes 30 minutes to run.
Please let me know how to make this more efficient.

CodePudding user response:

IIUC, you can use:

out = (
    df.drop_duplicates(['LOC_ID', 'SUB_ID'])
      .assign(count=lambda x: x.groupby('LOC_ID').transform('nunique'))
      .loc[lambda x: x.pop('count') >= 3].groupby('LOC_ID').sample(n=3)
)
print(out)

# Output
  LOC_ID  SUB_ID
5      a     903
0      a     222
3      a     888
8      b     314
7      b     721
6      b     231

CodePudding user response:

You can use groupby apply:

(df
 .groupby('LOC_ID', group_keys=False)
 .apply(lambda g: g if len(g2:=g.drop_duplicates('SUB_ID'))<3
                  else g2.sample(n=3))
 )

Output:

  LOC_ID  SUB_ID
5      a     903
4      a     918
3      a     888
8      b     314
7      b     721
6      b     231

NB. If you like this, still consider accepting @Corralien's answer. I would probably not have made an attempt without his answer

  • Related