I have a dataframe full of leads with different territories assigned that has a total length of 600k rows. I want to return a dataframe with 400 records from each territory but can't find a way to do so.
Here is what I have so far with a sample dataset:
original dataset:
Account Name Territory group
366663 THOMPSON RAY E South Carolina175 g7
529113 SOUTHERN TRADITION REALTY South Carolina175 g7
143584 DELANCO INSPECTION CENTER New Jersey221 g6
17636 ONE VISION ELECTRIC New Jersey221 g6
561095 SIMPLEFLOORS NORTH HOLLYWOOD Texas73 g11
306094 TEXAS REALTY CAFE Texas73 g11
say I want to return 1 of each territory in the final dataset: desired output:
Account Name Territory group
366663 THOMPSON RAY E South Carolina175 g7
143584 DELANCO INSPECTION CENTER New Jersey221 g6
561095 SIMPLEFLOORS NORTH HOLLYWOOD Texas73 g11
I don't care which records from each territory are returned in the final result just that there is the same number of each (in practice I will want more than just 1 record in each so drop_duplicates
on subset Territory
wouldn't work).
I've tried using groupby but can't figure out how to do anything but create groups of all the records in each territory. Any help appreciated. Thanks.
CodePudding user response:
Use groupby
and sample
:
>>> df.groupby("Territory").sample(1)
Account Name Territory group
143584 DELANCO INSPECTION CENTER New Jersey221 g6
529113 SOUTHERN TRADITION REALTY South Carolina175 g7
561095 SIMPLEFLOORS NORTH HOLLYWOOD Texas73 g11
CodePudding user response:
You can also cumcount
(the counter starts at 0)
# N = 1 (or N = 400)
>>> df[df.groupby('Territory').cumcount() < N]
Account Name Territory group
366663 THOMPSON RAY E South Carolina175 g7
143584 DELANCO INSPECTION CENTER New Jersey221 g6
561095 SIMPLEFLOORS NORTH HOLLYWOOD Texas73 g11