I have a large dataset with similar data:
>>> df = pd.DataFrame(
... {'A': ['one', 'two', 'two', 'one', 'one', 'three'],
... 'B': ['a', 'b', 'c', 'a', 'a', np.nan]})
>>> df
A B
0 one a
1 two b
2 two c
3 one a
4 one a
5 three NaN
There are two aggregation functions 'any' and 'unique':
>>> df.groupby('A')['B'].any()
A
one True
three False
two True
Name: B, dtype: bool
>>> df.groupby('A')['B'].unique()
A
one [a]
three [nan]
two [b, c]
Name: B, dtype: object
but I want to get the folowing result (or something close to it):
A
one a
three False
two True
I can do it with some complex code, but it is better for me to find appropriate function in python packages or the easiest way to solve problem. I'd be grateful if you could help me with that.
CodePudding user response:
You can aggregate Series.nunique
for first column and unique values with remove possible missing values for another columns:
df1 = df.groupby('A').agg(count=('B','nunique'),
uniq_without_NaNs = ('B', lambda x: x.dropna().unique()))
print (df1)
count uniq_without_NaNs
A
one 1 [a]
three 0 []
two 2 [b, c]
Then create mask if greater column count
by 1
and replace values by uniq_without_NaNs
if equal count
with 1
:
out = df1['count'].gt(1).mask(df1['count'].eq(1), df1['uniq_without_NaNs'].str[0])
print (out)
A
one a
three False
two True
Name: count, dtype: object
CodePudding user response:
>>> g = df.groupby("A")["B"].agg
>>> nun = g("nunique")
>>> pd.Series(np.select([nun > 1, nun == 1],
[True, g("unique").str[0]],
default=False),
index=nun.index)
A
one a
three False
two True
dtype: object
- get a hold on the group aggreagator
- count number of uniques
- if > 1, i.e., more than 1 uniques, put True
- if == 1, i.e., only 1 unique, put that unique value
- else, i.e., no uniques (full NaNs), put False
CodePudding user response:
You can combine groupby
with agg
and use boolean mask to choose the correct output:
# Your code
agg = df.groupby('A')['B'].agg(['any', 'unique'])
# Boolean mask to choose between 'any' and 'unique' column
m = agg['unique'].str.len().eq(1) & agg['unique'].str[0].notna()
# Final output
out = agg['any'].mask(m, other=agg['unique'].str[0])
Output:
>>> out
A
one a
three False
two True
>>> agg
any unique
A
one True [a]
three False [nan]
two True [b, c]
>>> m
A
one True # choose 'unique' column
three False # choose 'any' column
two False # choose 'any' column
CodePudding user response:
new_df = df.groupby('A')['B'].apply(lambda x: x.notna().any())
new_df = new_df .reset_index()
new_df .columns = ['A', 'B']
this will give you:
A B
0 one True
1 three False
2 two True
now if we want to find the values we can do:
df.groupby('A')['B'].apply(lambda x: x[x.notna()].unique()[0] if x.notna().any() else np.nan)
which gives:
A
one a
three NaN
two b