I have a dataframe which I am grouping based on the names of each row using the groupby function. I then want to reduce each group to a given size. I then add these groups back into a database to use for other processes. Currently I am doing this in a for loop but this seems really inefficient. Is there a method which pandas has to do this more efficiently?
grouped = df.groupby(['NAME'])
total = grouped.ngroups
df_final = pd.DataFrame()
for name, group in grouped:
target_number_rows = 10
if len(group.index) > target_number_rows:
shortened = group[::int(len(group.index) / target_number_rows)]
df_final = pd.concat([df_final, shortened], ignore_index=True)
CodePudding user response:
Group by the name and apply a sample
(that'll take randomly N within that group) where N is either your desired amount or the complete amount for that group, eg:
out = df.groupby('NAME').apply(lambda g: g.sample(min(len(g), target_number_rows)))
Otherwise, take the first N or last N, eg:
out = df.groupby('NAME').head(target_number_rows)
# or...
out = df.groupby('NAME').tail(target_number_rows)