Return a limit number rows of a dataframe for each unique value in a columns-CodePudding

I have a dataframe full of leads with different territories assigned that has a total length of 600k rows. I want to return a dataframe with 400 records from each territory but can't find a way to do so.

Here is what I have so far with a sample dataset:

original dataset:

                              Account Name   Territory            group
366663                       THOMPSON RAY E  South Carolina175    g7
529113            SOUTHERN TRADITION REALTY  South Carolina175    g7
143584            DELANCO INSPECTION CENTER      New Jersey221    g6
17636                   ONE VISION ELECTRIC      New Jersey221    g6
561095         SIMPLEFLOORS NORTH HOLLYWOOD       Texas73         g11
306094                    TEXAS REALTY CAFE       Texas73         g11

say I want to return 1 of each territory in the final dataset: desired output:

                              Account Name   Territory            group
366663                       THOMPSON RAY E  South Carolina175    g7
143584            DELANCO INSPECTION CENTER      New Jersey221    g6
561095         SIMPLEFLOORS NORTH HOLLYWOOD       Texas73         g11

I don't care which records from each territory are returned in the final result just that there is the same number of each (in practice I will want more than just 1 record in each so drop_duplicates on subset Territory wouldn't work).

I've tried using groupby but can't figure out how to do anything but create groups of all the records in each territory. Any help appreciated. Thanks.

CodePudding user response：

Use groupby and sample:

>>> df.groupby("Territory").sample(1)
                        Account Name          Territory group
143584     DELANCO INSPECTION CENTER      New Jersey221    g6
529113     SOUTHERN TRADITION REALTY  South Carolina175    g7
561095  SIMPLEFLOORS NORTH HOLLYWOOD            Texas73   g11

CodePudding user response：

You can also cumcount (the counter starts at 0)

# N = 1  (or N = 400)
>>> df[df.groupby('Territory').cumcount() < N]
                        Account Name          Territory group
366663                THOMPSON RAY E  South Carolina175    g7
143584     DELANCO INSPECTION CENTER      New Jersey221    g6
561095  SIMPLEFLOORS NORTH HOLLYWOOD            Texas73   g11