I'm a newbie. Consider the following. For any LOC_ID that has more than 3 unique SUB_IDs, I take a random sample of 3 unique SUB_IDs to piece together a new dataframe with concatenate:
df_concat = pd.DataFrame(columns=['LOC_ID', 'SUB_ID'])
for loc in test_df['LOC_ID'].unique():
sub_ids = test_df[test_df['LOC_ID'] == loc]['SUB_ID'].unique()
if len(sub_ids) > 3:
np.random.seed(622)
sub_ids_max = np.random.choice(sub_ids, size=3, replace=False)
df_sample = test_df[test_df['SUB_ID'].isin(sub_ids_max)]
else:
df_sample = test_df[test_df['SUB_ID'].isin(sub_ids)]
df_concat = pd.concat([df_concat, df_sample], ignore_index=True)
The following is the real case where I also have a year component with 6 years. Each LOC_ID within locids_sub has more than 10 unique SUB_IDs in at least one year. Per LOC_ID, per year, I need to ensure that there are no more than 10 unique SUB_ID's:
max_subid = 10
years = df_bb['Year'].unique()
count_df = df_bb.groupby(['LOC_ID', 'Year']).nunique().reset_index()
locids_sub = count_df[count_df['SUB_ID'] > max_subid]['LOC_ID'].unique()
# Subset df_bb by locids_sub
df_bb_sub = df_bb[df_bb['LOC_ID'].isin(locids_sub)][['Year', 'LOC_ID', 'SUB_ID']]
df_concat = pd.DataFrame(columns=df_bb.columns) # Initializing df
for year in years:
for loc in locids_sub:
sub_ids = df_bb_sub[(df_bb_sub['Year'] == year) & (df_bb_sub['LOC_ID'] == loc)]['SUB_ID'].unique()
if len(sub_ids) > max_subid:
np.random.seed(year int(loc[-2:]))
sub_ids_max = np.random.choice(sub_ids, size=max_subid, replace=False)
df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids_max)]
else:
df_sample = df_bb[df_bb['SUB_ID'].isin(sub_ids)]
df_concat = pd.concat([df_concat, df_sample], ignore_index=True)
With 6 years, 1460 LOC_IDs in locids_sub, and 1828201 rows in df_bb, this takes 30 minutes to run.
Please let me know how to make this more efficient.
CodePudding user response:
IIUC, you can use:
out = (
df.drop_duplicates(['LOC_ID', 'SUB_ID'])
.assign(count=lambda x: x.groupby('LOC_ID').transform('nunique'))
.loc[lambda x: x.pop('count') >= 3].groupby('LOC_ID').sample(n=3)
)
print(out)
# Output
LOC_ID SUB_ID
5 a 903
0 a 222
3 a 888
8 b 314
7 b 721
6 b 231
CodePudding user response:
You can use groupby apply
:
(df
.groupby('LOC_ID', group_keys=False)
.apply(lambda g: g if len(g2:=g.drop_duplicates('SUB_ID'))<3
else g2.sample(n=3))
)
Output:
LOC_ID SUB_ID
5 a 903
4 a 918
3 a 888
8 b 314
7 b 721
6 b 231
NB. If you like this, still consider accepting @Corralien's answer. I would probably not have made an attempt without his answer