Home > Software engineering >  Faster Way to Categorize Unique Row Combinations based on a Condition in Python Pandas?
Faster Way to Categorize Unique Row Combinations based on a Condition in Python Pandas?

Time:10-25

I have a dataset that has the same layout as the one below. I want to categorize where each person went in their car for a given month.

Date Visitor Car Name Location
6-2022 Michael Chrysler Work
6-2022 Jim Subaru Home
6-2022 Jim Subaru Work
6-2022 Michael Chrysler Karate
7-2022 Michael Chrysler Work
7-2022 Jim Subaru Home
7-2022 Jim Subaru Work

The solution would look something like this:

Date Visitor Car Name Location
6-2022 Michael Chrysler Work & Karate
6-2022 Jim Subaru Work & Home
7-2022 Michael Chrysler Work
7-2022 Jim Subaru Work & Home

I HAVE a solution but it scales linearly with time as the number of rows increases. The fastest I have gotten it to run over ~32,000 rows is ~15 minutes.

CodePudding user response:

You could try as follows:

  • Use df.groupby on Date, Visitor, Car Name, and apply join to the values of Location.
  • To change the last , delimiter to an ampersand, you could use Series.replace with a regex pattern.
res = df.groupby(['Date','Visitor','Car Name'], 
                 as_index=False, sort=False)['Location'].agg(', '.join)

# changing each last `,` delimiter into an `&`
res['Location'] = res['Location'].replace(r',\s(?=[^,]*$)',' & ', regex=True)

print(res)

     Date  Visitor  Car Name       Location
0  6-2022  Michael  Chrysler  Work & Karate
1  6-2022      Jim    Subaru    Home & Work
2  7-2022  Michael  Chrysler           Work
3  7-2022      Jim    Subaru    Home & Work

If you expect duplicates per group (and want to get rid of them), you could use .agg(lambda x: ', '.join(set(x))) instead of .agg(', '.join). Mind you, this will affect the order, not sure if this is a problem. If so, there are of course also ways to get rid of duplicates in a list, while maintained order.

CodePudding user response:

Try this:

df.groupby(['Date', 'Visitor', 'Car Name'], as_index=False)['Location'].agg(' & '.join)

Output:

     Date  Visitor  Car Name       Location
0  6-2022      Jim    Subaru    Home & Work
1  6-2022  Michael  Chrysler  Work & Karate
2  7-2022      Jim    Subaru    Home & Work
3  7-2022  Michael  Chrysler           Work
  • Related