I have a dataset like below. I need to do count of ticket_id with status closed based on groupby and I need to get week start(sunday) and week end from ticket_time column. Input dataframe
id status ticket_time product country
1260057 open 2021-10-03 01:20:00 Broadband Grenada
2998178 open 2021-10-06 00:00:00 Fixed Voice Bahamas
3762949 closed 2021-10-04 00:00:00 Fixed Voice St Lucia
3766608 closed 2021-10-10 00:00:00 Broadband St Lucia
3767125 closed 2021-10-03 00:00:00 TV Antigua
etc.................
Output dataframe EXAMPLE:
country product week_startdate week_enddate Value
antigua TV 2021-10-03 00:00:00 2021-10-10 00:00:00 Count(id) with status=closed
St Lucia Broadband ... .... ...
St Lucia Fixedvoice ... .... ...
For each country and each product I should get the count of id which is having closed status. Is groupby country and product will work?
INPUT Dataset
{'ticket_id': [1260057,
2998178,
3762949,
3766608,
3767125,
6050009,
6050608,
6050972,
6052253,
6053697,
6053812,
6053848,
6053970,
6054363,
6054549,
6055529,
6057303,
6057364,
6058428,
6101321,
6103451,
6103688,
6105180,
693170,
1259365,
1259466,
1260843,
3762579,
3762778,
3764140,
3768850,
6050528,
6050595,
6051099,
6053704,
6054013,
6054582,
6055323,
6056096,
6056363,
6057701,
6058251,
6058323,
6102386,
6103121,
6104844,
6104935,
692914,
1260730,
2997275],
'status': ['open',
'open',
'closed',
'closed',
'closed',
'closed',
'open',
'open',
'closed',
'open',
'open',
'closed',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'closed',
'closed',
'closed',
'open',
'open',
'closed',
'closed',
'open',
'closed',
'closed',
'closed',
'open',
'closed',
'closed',
'closed',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'open',
'cancelled',
'open',
'open',
'open',
'closed',
'closed',
'open'],
'ticket_start_time': [Timestamp('2021-10-04 01:20:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-01 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-03 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-05 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-02 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-04 00:00:00'),
Timestamp('2021-10-01 00:00:00')],
'product_type': ['Broadband',
'Fixed Voice',
'Fixed Voice',
'Broadband',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Fixed Voice',
'Broadband',
'Fixed Voice',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'TV',
'Fixed Voice',
'TV',
'TV',
'TV',
'TV',
'TV',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Broadband',
'Fixed Voice',
'Fixed Voice'],
'org_country': ['Grenada',
'Bahamas',
'St Lucia',
'St Lucia',
'Antigua',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Trinidad',
'Trinidad',
'Trinidad',
'Trinidad',
'Curacao',
'Grenada',
'Grenada',
'Grenada',
'St Lucia',
'St Lucia',
'St Vincent',
'St Lucia',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Jamaica',
'Trinidad',
'Trinidad',
'Trinidad',
'Trinidad',
'Curacao',
'Grenada',
'Bahamas']}
CodePudding user response:
Modified answer
After many explanations, it seems that the OP is looking for week start and end as a result of aggregate, not as a grouping.
Therefore, here is the modified answer:
weekday = 6 # Sunday
wo = pd.offsets.Week(weekday=weekday, normalize=True)
t = df['ticket_start_time'] wo
gbcols = ['org_country', 'product_type']
agg_dict = {
'ticket_id': 'count',
'week_startdate': 'first',
'week_enddate': 'last',
}
out = (
df
.assign(week_startdate=t - wo, week_enddate=t)
.groupby(['ticket_id', 'status'] gbcols).agg(agg_dict)
.query("status == 'closed'").groupby(gbcols).agg(agg_dict)
.reset_index()
.rename(columns={'ticket_id': 'Value'})
)
From the provided 'input data', we get:
>>> out
org_country product_type Value week_startdate week_enddate
0 Antigua TV 1 2021-10-03 2021-10-10
1 Curacao Broadband 1 2021-10-03 2021-10-10
2 Grenada Broadband 1 2021-09-26 2021-10-03
3 Grenada Fixed Voice 2 2021-09-26 2021-10-10
4 Jamaica Broadband 5 2021-09-26 2021-10-10
5 Jamaica TV 1 2021-09-26 2021-10-03
6 St Lucia Broadband 2 2021-09-26 2021-10-10
7 St Lucia Fixed Voice 1 2021-09-26 2021-10-03
8 St Lucia TV 1 2021-09-26 2021-10-03
9 St Vincent Broadband 1 2021-09-26 2021-10-03
10 Trinidad Broadband 2 2021-10-03 2021-10-10
11 Trinidad TV 1 2021-09-26 2021-10-03
Explanation
First the trick of adding / then subtracting the Week offset wo
gets around the problem of subtracting an offset when the date is at the offset boundary. For example:
t = pd.to_datetime([
'2021-10-02', '2021-10-03', '2021-10-03 01:00',
'2021-10-04', '2021-10-09 23:59:59', '2021-10-10']
)
>>> pd.DataFrame(dict(t=t, ws=t wo - wo, we=t wo))
t ws we
0 2021-10-02 00:00:00 2021-09-26 2021-10-03
1 2021-10-03 00:00:00 2021-10-03 2021-10-10
2 2021-10-03 01:00:00 2021-10-03 2021-10-10
3 2021-10-04 00:00:00 2021-10-03 2021-10-10
4 2021-10-09 23:59:59 2021-10-03 2021-10-10
5 2021-10-10 00:00:00 2021-10-10 2021-10-17
Next, we want the count distinct of ticket_id
where the status is 'closed'
. This is a bit trickier. We do it with two groupby. The first one is to tally the counts for each (ticket, status)
(and all other grouping columns) group. There could be multiple instances of the same ticket being open or closed, but we want to disregard that.
tmp = (
df
.assign(week_startdate=t - wo, week_enddate=t)
.groupby(['ticket_id', 'status'] gbcols)
.agg(agg_dict)
)
>>> tmp.head()
ticket_id week_startdate \
ticket_id status org_country product_type
692914 closed Curacao Broadband 1 2021-10-03
693170 open Curacao Broadband 1 2021-10-03
1259365 closed Grenada Fixed Voice 1 2021-09-26
1259466 closed Grenada Broadband 1 2021-09-26
1260057 open Grenada Broadband 1 2021-10-03
week_enddate
ticket_id status org_country product_type
692914 closed Curacao Broadband 2021-10-10
693170 open Curacao Broadband 2021-10-10
1259365 closed Grenada Fixed Voice 2021-10-03
1259466 closed Grenada Broadband 2021-10-03
1260057 open Grenada Broadband 2021-10-10
Finally, we now simply aggregate again, but only grouping on the gbcols
columns, and where status is 'closed' in that tally, giving us the complete expression above.
Addendum
The OP would like to convert the week_startdate
and end date to appear as strings, with full H:M:S
resolution. For this:
for k in ['week_startdate', 'week_enddate']:
out[k] = out[k].dt.strftime('%F %T')
And now:
>>> out
org_country product_type Value week_startdate week_enddate
0 Antigua TV 1 2021-10-03 00:00:00 2021-10-10 00:00:00
1 Curacao Broadband 1 2021-10-03 00:00:00 2021-10-10 00:00:00
2 Grenada Broadband 1 2021-09-26 00:00:00 2021-10-03 00:00:00
3 Grenada Fixed Voice 2 2021-09-26 00:00:00 2021-10-10 00:00:00
4 Jamaica Broadband 5 2021-09-26 00:00:00 2021-10-10 00:00:00
5 Jamaica TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
6 St Lucia Broadband 2 2021-09-26 00:00:00 2021-10-10 00:00:00
7 St Lucia Fixed Voice 1 2021-09-26 00:00:00 2021-10-03 00:00:00
8 St Lucia TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
9 St Vincent Broadband 1 2021-09-26 00:00:00 2021-10-03 00:00:00
10 Trinidad Broadband 2 2021-10-03 00:00:00 2021-10-10 00:00:00
11 Trinidad TV 1 2021-09-26 00:00:00 2021-10-03 00:00:00
CodePudding user response:
IIUC:
def week_start_day(s):
return s - (pd.to_timedelta((s.dt.weekday 1)%7, unit='D'))
def week_end_day(s):
return s pd.to_timedelta(7-(s.dt.weekday 2)%7, unit='D')
df['ticket_time'] = pd.to_datetime(df['ticket_time'])
df = df[df['status'] == 'closed'].groupby(['country', 'product']).agg(week_startdate = pd.NamedAgg(column='ticket_time', aggfunc=week_start_day),
week_enddate = pd.NamedAgg(column='ticket_time', aggfunc=week_end_day),
Value = pd.NamedAgg(column='id', aggfunc="size"))
df.reset_index(inplace=True)
OUTPUT
country product week_startdate week_enddate Value
0 Antigua TV 2021-10-03 2021-10-09 1
1 St Lucia Broadband 2021-10-10 2021-10-16 1
2 St Lucia Fixed Voice 2021-10-03 2021-10-09 1
Setup
data = {'id': [1260057, 2998178, 3762949, 3766608, 3767125],
'status': ['open', 'open', 'closed', 'closed', 'closed'],
'ticket_time': ['2021-10-03 01:20:00', '2021-10-06 00:00:00', '2021-10-04 00:00:00',
'2021-10-10 00:00:00', '2021-10-03 00:00:00'],
'product': ['Broadband', 'Fixed Voice', 'Fixed Voice', 'Broadband', 'TV'],
'country': ['Grenada', 'Bahamas', 'St Lucia', 'St Lucia', 'Antigua']}
df = pd.DataFrame.from_dict(data)