Home > Enterprise >  Refactor code in a pythonic way to get the most popular elements in pandas dataframe
Refactor code in a pythonic way to get the most popular elements in pandas dataframe

Time:01-06

This is the dataframe:

image_file objects
0 image_1.png [car, car, car, car, car, car, car, bus, car]
1 image_2.png [traffic light, car, car, car, car, car, car, car, car, car]
2 image_3.png [car, traffic light, person, car, car, car, car]
3 image_4.png [person, person, car, car, bicycle, car, car]
4 image_5.png [car, car, car, car, car, person, car, car, car]

The objects column is a list with the frequency of the object in the image.

I could obtained the most frequent elements according if there are exactly 3 or less elements in the image with this code:

result = []

# Iterate through rows of the dataframe
for i, row in df.iterrows():
    # Count the frequency of each object in the image
    frequencies = Counter(row['objects'])
    # Sort the frequencies from most to least common
    sorted_frequencies = sorted(frequencies.items(),
                                    key=lambda x: x[1],
                                    reverse=True
                                    )

    # Check if there are less than 3 different objects in the image
    if len(sorted_frequencies) <= 3:
        # If so, append all of the objects to the result list
        result.extend([obj for obj, _ in sorted_frequencies])

frequency_3_most_pop = dict(Counter(result))

My concern is that iterrows is not the best option for perform an iteration over a dataframe and I would like to refactor the code to avoid it. Any help would be appreciated.

CodePudding user response:

Assuming you have lists in df['objects'], you can simplify your code:

frequency_3_most_pop = dict(Counter(x for l in df['objects']
                                    if len(c:=Counter(l))<=3 for x in c))

NB. requires python 3.8 due to the walrus (:=) operator (PEP0572).

Output:

{'car': 5, 'bus': 1, 'traffic light': 2, 'person': 3, 'bicycle': 1}

timing

performed on 6k rows

# original approach
346 ms ± 49.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

# Counter generator (this approach)
11.5 ms ± 1.01 ms per loop (mean ± std. dev. of 7 runs, 100 loops each)

Used input:

df = pd.DataFrame({'image_file': ['image_1.png', 'image_2.png', 'image_3.png', 'image_4.png', 'image_5.png'],
                   'objects': [['car', 'car', 'car', 'car', 'car', 'car', 'car', 'bus', 'car'],
                               ['traffic light', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car', 'car'],
                               ['car', 'traffic light', 'person', 'car', 'car', 'car', 'car'],
                               ['person', 'person', 'car', 'car', 'bicycle', 'car', 'car'],
                               ['car', 'car', 'car', 'car', 'car', 'person', 'car', 'car', 'car']],
                  })
  • Related