count if event occurs per city-CodePudding

I am trying to count how often an event occurs before another event occurs per city. To be more specific I am trying to count how often event number 21 occurrs before 2, and the other way around.

The dataset looks like this

import pandas as pd

data = { 'city': ['Amsterdam', 'Vienna', 'Paris', 'Paris',  'Istanbul', 'Istanbul','Delhi', 'London', 'London', 'Barcelona', 'Barcelona'], 
        'date': [ '2022-09-01T11:34:53', '2022-09-01T13:37:37', '2022-09-01 10:44:22.000', '2022-09-01T10:39:33', '2022-09-01 16:18:24.000', '2022-09-01T16:15:14', '2022-09-01T13:28:33', '2022-09-01 15:50:54.000', '2022-09-01T15:51:07', '2022-09-01 12:24:26.000','2022-09-01T12:24:07' ], 
        'year': [ '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022', '2022','2022' ],
        'month': [9,9,9,9,9,9,9,9,9,9,9 ],
        'hour': [ 11,13,11,10,17,16,13,16,16,13,12 ],
        'eventcode': [ 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J', 'J','J' ],
        'eventnumber': [ '21', '21', '2', '21', '2', '21', '21', '2', '21', '2','21' ]}

df =  pd.DataFrame(data, columns= ['city', 'date', 'year',  'month', 'hour', 'eventcode','eventnumber' ])
print (df)

In this dataset, when looking at the city and date, event 2 occurs 1 time before event 21, and event 21 occurs 3 times before event 2.

The code below works for counting the number of occurences of event 21 happening before 2. But when I switch the 2 and 21 in the code, it results in an infinite loop...

bc=np.array(df['city'])
un_bc,bc_index,bc_count=np.unique(bc,return_counts=True,return_index=True)
new_df=pd.DataFrame()
count=0

for i,j in zip(bc_index,bc_count):
    j=j i-1
    while i 1 <= j:
        if df.iat[i,7]==21 and df.iat[i 1,7]==2:
            count  =1
            new_df=new_df.append(df[i:i 2])
        i  =1
        
print(count)

Code above works, code below does not.

for i,j in zip(bc_index,bc_count):
    j=j i-1
    while i 1 <= j:
        if df.iat[i,7]==2 and df.iat[i 1,7]==21:
            count  =1
            new_df=new_df.append(df[i:i 2])
        i  =1
        
print(count)

CodePudding user response：

One solution could be as follows:

df['date'] = pd.to_datetime(df['date'])

res = df.sort_values('date').astype({'eventnumber':int})\
    .groupby('city')['eventnumber'].diff().value_counts()
res.index = res.index.map({-19:'21_before_2', 19: '2_before_21'})

print(res)

21_before_2    3
2_before_21    1
Name: eventnumber, dtype: int64

Or, if you don't want to turn df['date'] into datetime, use .sort_values(['year', 'month', 'hour']).

Explanation

First, use pd.to_datetime to turn date strings into datetime.
Next, 1) sort the df on date (using df.sort_values), 2) turn eventnumber strings into ints (using df.astype), 3) apply df.groupby on column city, 4) retrieve diff, and 5) retrieve Series.value_counts.
The result will be a count for -19 (21 before 2) and 19 (2 before 21), with all cities with just one value being dropped. These values will end up as the index, so we can use map to assign the appropriate labels.

CodePudding user response：

Looks like what you want is this?

df = df.sort_values(by='date')

for city in df.city.unique():
    print(city)
    cond_city = df.city == city
    cond2_before_21 = (df.eventnumber.shift() == '2') & (df.eventnumber == '21')
    cond21_before_2 = (df.eventnumber.shift() == '21') & (df.eventnumber == '2')
    print((cond_city & cond21_before_2).sum())
    print((cond_city & cond2_before_21).sum())

Output:

Paris
0
1
Barcelona
0
0
London
0
0
Istanbul
0
0
Amsterdam
0
0
Delhi
0
0
Vienna
0
0

CodePudding user response：

It's not a pandas solution, but one way is to create a default dictionary of the list of events per city, and then count occurrences of one event before the other.

from collections import defaultdict
events_by_city = defaultdict(list)
for city, event_num in zip(data['city'], map(int, data['eventnumber'])):
    events_by_city[city].append(event_num)

The default dictionary looks like;

defaultdict(list,
            {'Amsterdam': [21],
             'Vienna': [21],
             'Paris': [2, 21],
             'Istanbul': [2, 21],
             'Delhi': [21],
             'London': [2, 21],
             'Barcelona': [2, 21]})

Next you can count for the event that you want;

print(sum(1 for events in events_by_city.values() if events[0] == 21 and 2 in events))

Output

In this case you're counting occurrences in the default dictionary of the first item being 21 and also 2 being in the events list.

As you can see from the default dict, this doesn't actually ever occur, 21 never actually precedes 2 in any list, because all cities that have 21 as the first event number don't have an event number 2.

However, reversing events;

print(sum(1 for events in events_by_city.values() if events[0] == 2 and 21 in events))

Output

If you want to relax the criterion that both events must occur in then drop the last if condition, ie and 2 in events in the first case.

Edit to address the comment of @Lucas M. Uriarte and demonstrate the efficiency of the accepted answer.

from collections import defaultdict
import datetime 

events_by_city = defaultdict(list)
for city, event_num, date_time in zip(data['city'], map(int, data['eventnumber']), map(lambda x: datetime.datetime.strptime(x.replace('.000','').replace('T',' '), '%Y-%m-%d %H:%M:%S'), data['date'])):
    events_by_city[city].append((event_num, date_time))

print(sum(1 for events in events_by_city.values() if sorted(events, key=lambda x: x[1])[0][0] == 21 and 2 in [event[0] for event in events]))