I have a data set that contains hourly data of marketing campaigns. There are several campaigns and not all of them are active during the 24 hours of the day. My goal is to eliminate all rows of active hour campaigns where I don't have the 24 data rows of a single day.
The raw data contains a lot of information like this:
I created a dummy variable with ones to be able to count single instance of rows. This is the code I applied to be able to see the results I want to get.
tmp = df.groupby(['id','date']).count()
tmp.query('Hour' > 23)
I get the following results:
These results illustrate exactly the data that I want to keep in my data frame.
How can I eliminate the data per campaign per day that does not reach 24? The objective is not the count but the real data. Therefore ungrouped from what I present in the second picture.
I appreciate the guidance.
CodePudding user response:
Use transform
to broadcast the count over all rows of your dataframe the use loc
as replacement of query
:
out = df.loc[df.groupby(['id', 'date'])['Hour'].transform('count')
.loc[lambda x: x > 23].index]
CodePudding user response:
drop the data you don't want before you do the groupby
you can use .loc or .drop, I am unfamiliar with .query