Home > Net >  Summarizing a pandas DataFrame by group using a custom function results in wrong output
Summarizing a pandas DataFrame by group using a custom function results in wrong output

Time:07-24

I have a pandas DataFrame that I want to summarize by group, using a custom function that resolves to a boolean value.

Consider the following data. df describes 4 people, and for each person the fruits they like.

import numpy as np
import pandas as pd

df = pd.DataFrame({
    "name": ["danny", "danny", "danny", "monica", "monica", "monica", "fred", "fred", "sam", "sam"],
    "fruit": ["apricot", "apple", "orange", "apricot", "banana", "watermelon", "apple", "apricot", "apricot", "peach"]
})

print(df)
##      name       fruit
## 0   danny     apricot
## 1   danny       apple
## 2   danny      orange
## 3  monica     apricot
## 4  monica      banana
## 5  monica  watermelon
## 6    fred       apple
## 7    fred    apricot
## 8     sam    apricot
## 9     sam       peach

I want to summarize this table to find the people who like both apricot and apple. In other words, my desired output is the following table

# desired output
##      name     fruit
## 0   danny     True
## 1  monica     False
## 2    fred     True
## 3     sam     False

My attempt

I first defined a function that searches for string(s) existence in a target list:

def is_needle_in_haystack(needle, haystack):
  return all(x in haystack for x in needle)

Examples that is_needle_in_haystack() works:

is_needle_in_haystack(["zebra", "lion"], ["whale", "lion", "dog"])
# False

is_needle_in_haystack(["rabbit", "cat"], ["hamster", "cat", "monkey", "rabbit"])
# True

Now I used is_needle_in_haystack() while grouping df by name:

target_fruits = ["apricot", "apple"]

df.groupby(df["name"]).agg({"fruit": lambda x: is_needle_in_haystack(target_fruits, x)})

Then why do I get the following output, which clearly not as expected?

##    fruit
## name         
## danny   False
## fred    False
## monica  False
## sam     False

What have I done wrong in my code?

CodePudding user response:

The problem is that haystack is a Series, when called in .agg, change to:

def is_needle_in_haystack(needle, haystack):
    return all(x in set(haystack) for x in needle)


target_fruits = ["apricot", "apple"]
res = df.groupby(df["name"]).agg({"fruit": lambda x: is_needle_in_haystack(target_fruits, x)})
print(res)

Output

        fruit
name
danny    True
fred     True
monica  False
sam     False

The in operator for Series, returns False, for example:

"hamster" in pd.Series(["hamster", "cat", "monkey", "rabbit"])
# False

CodePudding user response:

Alternatively, we can use some set logic to accomplish the same task:

target_fruits = {"apricot", "apple"}
res = df.groupby('name', as_index=False).agg(lambda x: target_fruits.issubset(x))

Output:

     name  fruit
0   danny   True
1    fred   True
2  monica  False
3     sam  False
  • Related