Home > Enterprise >  Counting distinct, until a certain condition based on another row is met
Counting distinct, until a certain condition based on another row is met

Time:04-20

I have the following df

Original df

Step  | CampaignSource | UserId
1         Banana          Jeff            
1         Banana          John
2         Banana          Jefferson
3         Website         Nunes
4         Banana          Jeff
5         Attendance      Nunes
6         Attendance      Antonio
7         Banana          Antonio
8         Website         Joseph
9         Attendance      Joseph
9         Attendance      Joseph

Desired output

Steps | CampaignSource | CountedDistinctUserid
  1      Website           2 (Because of different userids)
  2      Banana            1
  3      Banana            1
  4      Website           1
  5      Banana            1
  6      Attendance        1
  7      Attendance        1
  8      Attendance        1
  9      Attendance        1 (but i want to have 2 here even tho they have similar user ids and because is the 9th step)

What i want to do is impose a condition where if the step column which is made by strings equals '9', i want to count the userids as non distinct, any ideas on how i could do that? I tried applying a function but i just couldnt make it work.

What i am currently doing:

df[['Steps','UserId','CampaignSource']].groupby(['Steps','CampaignSource'],as_index=False,dropna=False).nunique()

CodePudding user response:

You can group by "Step" and use a condition on the group name:

df.groupby('Step')['UserId'].apply(lambda g: g.nunique() if g.name<9 else g.count())

output:

Step
1    2
2    1
3    1
4    1
5    1
6    1
7    1
8    1
9    2
Name: UserId, dtype: int64

As DataFrame:

(df.groupby('Step', as_index=False)
   .agg(CampaignSource=('CampaignSource', 'first'),
        CountedDistinctUserid=('CampaignSource', lambda g: g.nunique() if g.name<9 else g.count())
       )
)

output:

   Step CampaignSource  CountedDistinctUserid
0     1         Banana                      2
1     2         Banana                      1
2     3        Website                      1
3     4         Banana                      1
4     5     Attendance                      1
5     6     Attendance                      1
6     7         Banana                      1
7     8        Website                      1
8     9         Banana                      2

CodePudding user response:

You can apply different functions to different groups depending if condition match.

out = (df[['Steps','UserId','CampaignSource']]
       .groupby(['Steps','CampaignSource'],as_index=False,dropna=False)
       .apply(lambda g: g.assign(CountedDistinctUserid=( [len(g)]*len(g)
                                                         if g['Steps'].eq(9).all()
                                                         else [g['UserId'].nunique()]*len(g) ))))

print(out)

    Steps     UserId CampaignSource  CountedDistinctUserid
0       1       Jeff         Banana                      2
1       1       John         Banana                      2
2       2  Jefferson         Banana                      1
3       3      Nunes        Website                      1
4       4       Jeff         Banana                      1
5       5      Nunes     Attendance                      1
6       6    Antonio     Attendance                      1
7       7    Antonio         Banana                      1
8       8     Joseph        Website                      1
9       9     Joseph     Attendance                      2
10      9     Joseph     Attendance                      2
  • Related