Home > other >  Efficiently reduce the size of groups in a dataframe
Efficiently reduce the size of groups in a dataframe

Time:06-30

I have a dataframe which I am grouping based on the names of each row using the groupby function. I then want to reduce each group to a given size. I then add these groups back into a database to use for other processes. Currently I am doing this in a for loop but this seems really inefficient. Is there a method which pandas has to do this more efficiently?

grouped = df.groupby(['NAME'])

total = grouped.ngroups

df_final = pd.DataFrame()
for name, group in grouped:

    target_number_rows = 10

    if len(group.index) > target_number_rows:
        shortened = group[::int(len(group.index) / target_number_rows)]
        df_final = pd.concat([df_final, shortened], ignore_index=True)

CodePudding user response:

Group by the name and apply a sample (that'll take randomly N within that group) where N is either your desired amount or the complete amount for that group, eg:

out = df.groupby('NAME').apply(lambda g: g.sample(min(len(g), target_number_rows)))

Otherwise, take the first N or last N, eg:

out = df.groupby('NAME').head(target_number_rows)
# or...
out = df.groupby('NAME').tail(target_number_rows)
  • Related