Is there pandas aggregate function that combines features of 'any' and 'unique'?-CodePudding

I have a large dataset with similar data:

>>> df = pd.DataFrame(
...     {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
...      'B': ['a', 'b', 'c', 'a', 'a', np.nan]})          
>>> df
       A    B
0    one    a
1    two    b
2    two    c
3    one    a
4    one    a
5  three  NaN

There are two aggregation functions 'any' and 'unique':

>>> df.groupby('A')['B'].any()
A
one       True
three    False
two       True
Name: B, dtype: bool

>>> df.groupby('A')['B'].unique()
A
one         [a]
three     [nan]
two      [b, c]
Name: B, dtype: object

but I want to get the folowing result (or something close to it):

A
one           a
three     False
two        True

I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I'd be grateful if you could help me with that.

CodePudding user response：

You can aggregate Series.nunique for first column and unique values with remove possible missing values for another columns:

df1 = df.groupby('A').agg(count=('B','nunique'), 
                          uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
       count uniq_without_NaNs
A                             
one        1               [a]
three      0                []
two        2            [b, c]

Then create mask if greater column count by 1 and replace values by uniq_without_NaNs if equal count with 1:

out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one          a
three    False
two       True
Name: count, dtype: object

CodePudding user response：

>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
                        [True, g("unique").str[0]],
                        default=False),
              index=nun.index)

A
one          a
three    False
two       True
dtype: object

get a hold on the group aggreagator
count number of uniques
- if > 1, i.e., more than 1 uniques, put True
- if == 1, i.e., only 1 unique, put that unique value
- else, i.e., no uniques (full NaNs), put False

CodePudding user response：

You can combine groupby with agg and use boolean mask to choose the correct output:

# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])

# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()

# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])

Output:

>>> out
A
one          a
three    False
two       True

>>> agg
         any  unique
A                   
one     True     [a]
three  False   [nan]
two     True  [b, c]

>>> m
A
one       True  # choose 'unique' column
three    False  # choose 'any' column
two      False  # choose 'any' column

CodePudding user response：

new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']

this will give you:

       A      B
0    one   True
1  three  False
2    two   True

now if we want to find the values we can do:

df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)

which gives:

A
one        a
three    NaN
two        b