This is the data frame
Date Name Attempt Count
0 2015-01-02 Adam Yes
1 2015-01-29 Adam Yes
2 2015-02-02 David No
3 2015-03-02 David Yes
4 2015-03-02 Hardik Yes
5 2015-03-02 David No
Now I need to count the number of attempts of a person month-wise. Want to count only if attempt value is 'yes' and insert the value in count column.
This is the expected output
Date Name Attempt Count
0 2015-01-02 Adam Yes 2
1 2015-01-29 Adam Yes 2
2 2015-02-02 David No 0
3 2015-03-02 David Yes 1
4 2015-03-02 Hardik Yes 1
5 2015-03-02 David No 1
CodePudding user response:
Convert Date
column to datetime type if not already, then assign a temporary column as month extracting the month value from the Date
column.
Then filter out rows with Yes
value for Attempt
column, and group it by month
, and Name
columns, then call size
to get count, finally call to_frame
passing count
as column name to create a dataframe out of it, and merge it with the dataframe with left
merge passing the common keys for on
parameter, then fillna(0)
, drop month column, and convert count column as integer type.
df['Date'] = pd.to_datetime(df['Date'])
df['month']=df['Date'].dt.month
(df.merge(df[df['Attempt'].eq('Yes')]
.groupby(['month', 'Name'])
.size()
.to_frame('count'),
how='left',
on=['month', 'Name']
).fillna(0).drop(columns='month').astype({'count':int})
)
Date Name Attempt count
0 2015-01-02 Adam Yes 2
1 2015-01-29 Adam Yes 2
2 2015-02-02 David No 0
3 2015-03-02 David Yes 1
4 2015-03-02 Hardik Yes 1
5 2015-03-02 David No 1
CodePudding user response:
You will need to group the date and names on a month frequency and count how many attempts were "Yes" by previously filtering them out:
res = df[df["Attempt"].eq("Yes")]
.groupby([pd.Grouper(key='date', freq='M'), 'Name'])["Attempt"]
.count()
Then you can assign the data using map
as:
my_map = dict(zip(res.index.get_level_values('Name'), res.values))
df["Counts"] = df["Name"].map(my_map)
Output:
date Name Attempt Counts
0 2015-01-02 Adam Yes 2
1 2015-01-29 Adam Yes 2
2 2015-02-02 David No 1
3 2015-03-02 David Yes 1
4 2015-03-02 Hardik Yes 1
5 2015-03-02 David No 1
(Note that David has 1 attempt on the '2015-02-02' that you don't want apparently. In order to solve that, just use simple slicing rules as:
df.loc[df["Attempt"].eq("Yes"), "Counts"] = df.loc[df["Attempt"]
.eq("Yes"), "Name"]
.map(my_map)
In this case you would get:
date Name Attempt Counts
0 2015-01-02 Adam Yes 2.0
1 2015-01-29 Adam Yes 2.0
2 2015-02-02 David No NaN
3 2015-03-02 David Yes 1.0
4 2015-03-02 Hardik Yes 1.0
5 2015-03-02 David No NaN
After seeing the nice answer by @ThePyGuy. I was interested on checking performance of both his and my suggestions (none "correcter" than the other I hope ;)) and found out his suggestion to take 4.53 ms and mine 2.39 ms. Note that these statistics are for this small data set. It might be performance will scale weirdly with dataframe size.