I am using Pandas and have a dataframe with a date column, a unique ID column, and a dozen or so columns with numeric values. The following works well enough as a reproducible example:
df = pd.DataFrame(data = {
'day': ['2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01',
'2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02'],
'distinct_id': ['a', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'c', 'c'],
'value': [71, 72, 73, 74, 74, 73, 72, 71, 71, 72, 73]
})
I would like to group by day
and by distinct_id
only, leaving out all the other columns. I just need to know that on a given day
, a given distinct_id
fired an event.
# Desired output
day distinct_id
0 2020-10-01 a
1 2020-10-01 b
2 2020-10-01 c
3 2020-10-02 a
4 2020-10-02 b
5 2020-10-02 c
If this were SQL, I'd write something similar to:
SELECT DISTINCT day, distinct_id FROM df
In Pandas, I can't figure out how to do it. The following is the groupby object that I think I want:
df.groupby(['day', 'distinct_id'])
# Output: <pandas.core.groupby.generic.DataFrameGroupBy object at 0x7ffd71bbd430>
However, all of the examples I can find for how to turn that into a dataframe involve a numeric aggregate of some kind. I could theoretically do this and then drop the resulting columns, like this...
df2 = df.groupby(['day', 'distinct_id']).count().reset_index()
# My actual example has a dozen or so columns to drop
cols = ['day', 'distinct_id']
df2.drop([x for x in df.columns if not x in cols], axis = 1)
...but that feels like a workaround. Is there a more straightforward way to achieve this?
CodePudding user response:
We can select those columns and then drop_duplicates
.
df = pd.DataFrame(data = {
'day': ['2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01', '2020-10-01',
'2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02', '2020-10-02'],
'distinct_id': ['a', 'a', 'a', 'b', 'b', 'c', 'a', 'a', 'b', 'c', 'c'],
'value': [71, 72, 73, 74, 74, 73, 72, 71, 71, 72, 73]
})
df[['day', 'distinct_id']].drop_duplicates()