I am trying to count how often an event occurs before another event occurs per city. To be more specific I am trying to count how often event number 21 occurrs before 2, and the other way around.
The dataset looks like this
import pandas as pd
data = { 'city': ['Amsterdam', 'Vienna', 'Paris', 'Paris', 'Istanbul', 'Istanbul','Delhi', 'London', 'London', 'Barcelona', 'Barcelona'],
'date': [ '2022-09-01T11:34:53', '2022-09-01T13:37:37', '2022-09-01 10:44:22.000', '2022-09-01T10:39:33', '2022-09-01 16:18:24.000', '2022-09-01T16:15:14', '2022-09-01T13:28:33', '2022-09-01 15:50:54.000', '2022-09-01T15:51:07', '2022-09-01 12:24:26.000','2022-09-01T12:24:07' ],
'year': [ '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022','2022' ],
'month': [9,9,9,9,9,9,9,9,9,9,9 ],
'hour': [ 11,13,11,10,17,16,13,16,16,13,12 ],
'eventcode': [ 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J','J' ],
'eventnumber': [ '21', '21', '2', '21', '2', '21', '21', '2', '21', '2','21' ]}
df = pd.DataFrame(data, columns= ['city', 'date', 'year', 'month', 'hour', 'eventcode','eventnumber' ])
print (df)
In this dataset, when looking at the city and date, event 2 occurs 1 time before event 21, and event 21 occurs 3 times before event 2.
The code below works for counting the number of occurences of event 21 happening before 2. But when I switch the 2 and 21 in the code, it results in an infinite loop...
bc=np.array(df['city'])
un_bc,bc_index,bc_count=np.unique(bc,return_counts=True,return_index=True)
new_df=pd.DataFrame()
count=0
for i,j in zip(bc_index,bc_count):
j=j i-1
while i 1 <= j:
if df.iat[i,7]==21 and df.iat[i 1,7]==2:
count =1
new_df=new_df.append(df[i:i 2])
i =1
print(count)
Code above works, code below does not.
for i,j in zip(bc_index,bc_count):
j=j i-1
while i 1 <= j:
if df.iat[i,7]==2 and df.iat[i 1,7]==21:
count =1
new_df=new_df.append(df[i:i 2])
i =1
print(count)
CodePudding user response:
One solution could be as follows:
df['date'] = pd.to_datetime(df['date'])
res = df.sort_values('date').astype({'eventnumber':int})\
.groupby('city')['eventnumber'].diff().value_counts()
res.index = res.index.map({-19:'21_before_2', 19: '2_before_21'})
print(res)
21_before_2 3
2_before_21 1
Name: eventnumber, dtype: int64
Or, if you don't want to turn df['date']
into datetime
, use .sort_values(['year', 'month', 'hour'])
.
Explanation
- First, use
pd.to_datetime
to turn date strings into datetime. - Next, 1) sort the
df
ondate
(usingdf.sort_values
), 2) turneventnumber
strings intoints
(usingdf.astype
), 3) applydf.groupby
on columncity
, 4) retrievediff
, and 5) retrieveSeries.value_counts
. - The result will be a count for
-19
(21 before 2) and19
(2 before 21), with all cities with just one value being dropped. These values will end up as the index, so we can usemap
to assign the appropriate labels.
CodePudding user response:
Looks like what you want is this?
df = df.sort_values(by='date')
for city in df.city.unique():
print(city)
cond_city = df.city == city
cond2_before_21 = (df.eventnumber.shift() == '2') & (df.eventnumber == '21')
cond21_before_2 = (df.eventnumber.shift() == '21') & (df.eventnumber == '2')
print((cond_city & cond21_before_2).sum())
print((cond_city & cond2_before_21).sum())
Output:
Paris
0
1
Barcelona
0
0
London
0
0
Istanbul
0
0
Amsterdam
0
0
Delhi
0
0
Vienna
0
0
CodePudding user response:
It's not a pandas
solution, but one way is to create a default dictionary of the list of events per city, and then count occurrences of one event before the other.
from collections import defaultdict
events_by_city = defaultdict(list)
for city, event_num in zip(data['city'], map(int, data['eventnumber'])):
events_by_city[city].append(event_num)
The default dictionary looks like;
defaultdict(list,
{'Amsterdam': [21],
'Vienna': [21],
'Paris': [2, 21],
'Istanbul': [2, 21],
'Delhi': [21],
'London': [2, 21],
'Barcelona': [2, 21]})
Next you can count for the event that you want;
print(sum(1 for events in events_by_city.values() if events[0] == 21 and 2 in events))
Output
0
In this case you're counting occurrences in the default dictionary of the first item being 21
and also 2
being in the events list.
As you can see from the default dict, this doesn't actually ever occur, 21 never actually precedes 2 in any list, because all cities that have 21 as the first event number don't have an event number 2.
However, reversing events;
print(sum(1 for events in events_by_city.values() if events[0] == 2 and 21 in events))
Output
4
If you want to relax the criterion that both events must occur in then drop the last if
condition, ie and 2 in events
in the first case.
Edit to address the comment of @Lucas M. Uriarte and demonstrate the efficiency of the accepted answer.
from collections import defaultdict
import datetime
events_by_city = defaultdict(list)
for city, event_num, date_time in zip(data['city'], map(int, data['eventnumber']), map(lambda x: datetime.datetime.strptime(x.replace('.000','').replace('T',' '), '%Y-%m-%d %H:%M:%S'), data['date'])):
events_by_city[city].append((event_num, date_time))
print(sum(1 for events in events_by_city.values() if sorted(events, key=lambda x: x[1])[0][0] == 21 and 2 in [event[0] for event in events]))