Home > Software engineering >  Using Pandas groupby without a numeric value
Using Pandas groupby without a numeric value

Time:10-20

I am using Pandas and have a dataframe with a date column, a unique ID column, and a dozen or so columns with numeric values. The following works well enough as a reproducible example:

df = pd.DataFrame(data = {
    'day': ['2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01',
            '2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02'],
    'distinct_id': ['a', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'c', 'c'],
    'value': [71, 72, 73, 74, 74, 73, 72, 71, 71, 72, 73]
})

I would like to group by day and by distinct_id only, leaving out all the other columns. I just need to know that on a given day, a given distinct_id fired an event.

# Desired output
    day         distinct_id
0   2020-10-01  a
1   2020-10-01  b
2   2020-10-01  c
3   2020-10-02  a
4   2020-10-02  b
5   2020-10-02  c

If this were SQL, I'd write something similar to:

SELECT DISTINCT day, distinct_id FROM df

In Pandas, I can't figure out how to do it. The following is the groupby object that I think I want:

df.groupby(['day', 'distinct_id'])

# Output: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ffd71bbd430>

However, all of the examples I can find for how to turn that into a dataframe involve a numeric aggregate of some kind. I could theoretically do this and then drop the resulting columns, like this...

df2 = df.groupby(['day', 'distinct_id']).count().reset_index()

# My actual example has a dozen or so columns to drop
cols = ['day', 'distinct_id']
df2.drop([x for x in df.columns if not x in cols], axis = 1)

...but that feels like a workaround. Is there a more straightforward way to achieve this?

CodePudding user response:

We can select those columns and then drop_duplicates.

df = pd.DataFrame(data = {
    'day': ['2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01',
            '2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02'],
    'distinct_id': ['a', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'c', 'c'],
    'value': [71, 72, 73, 74, 74, 73, 72, 71, 71, 72, 73]
})

df[['day', 'distinct_id']].drop_duplicates()
  • Related