Ok so I have a first dataframe df1:
|timestamp |ip |
|2022-01-06 11:58:53 00:00|1.1.1.5. |
|2022-01-08 03:56:35 00:00|10.10.10.24|
|2022-01-09 22:29:30 00:00|3.3.3.89. |
|2022-03-08 22:37:52 00:00|8.8.8.88. |
And a second dataframe, df2:
|timestamp |other|
|2022-01-07 22:08:59 00:00|other|
|2022-01-07 23:08:59 00:00|other|
|2022-01-09 17:04:09 00:00|other|
|2022-03-05 17:04:09 00:00|other|
And I would like to count how many rows there is in df2 depending on the 2 consecutive timestamps in df1, meaning:
|timestamp |ip |count|
|2022-01-06 11:58:53 00:00|1.1.1.5 |NaN |
|2022-01-08 03:56:35 00:00|10.10.10.24|2 |
|2022-01-09 22:29:30 00:00|3.3.3.89 |1 |
|2022-03-08 22:37:52 00:00|8.8.8.88 |1 |
What I tried is to first create another column in df1 with the previous timestamp with this:
df1 = df1.assign(timestamp_b4=df1.timestamp.shift(1)).fillna({'timestamp_b4': df1.timestamp})
which gives me:
|timestamp |ip |timestamp_b4 |
|2022-01-06 11:58:53 00:00|1.1.1.5 |2022-03-08 22:37:52 00:00|
|2022-01-08 03:56:35 00:00|10.10.10.24|2022-01-06 11:58:53 00:00|
|2022-01-09 22:29:30 00:00|3.3.3.89 |2022-01-08 03:56:35 00:00|
|2022-03-08 22:37:52 00:00|8.8.8.88 |2022-01-09 22:29:30 00:00|
and then do some sort of
s = (df2[df2['timestamp'].between(df1['timestamp'], df1['timestamp_b4'])].size())
But unfortunately it does not work since pandas requires comparing identically-labeled objects.
Is there a good pandas/pythonic way to do that?
Thanks
CodePudding user response:
Here is one approach:
df1.merge(df2, on='timestamp', how='outer').sort_values('timestamp') \
.assign(c1=df1.loc[~df1['ip'].isna()]['ip'], c2=lambda x: x['c1'].bfill() ) \
.assign(count=lambda x: x.groupby('c2').apply('count').reset_index(drop=True)['timestamp']-1) \
.drop(['other','c1','c2'], axis=1).dropna().astype({'count': 'int32'})
timestamp ip count
0 2022-01-06 11:58:53 00:00 1.1.1.5. 0
1 2022-01-08 03:56:35 00:00 10.10.10.24 2
2 2022-01-09 22:29:30 00:00 3.3.3.89. 1
3 2022-03-08 22:37:52 00:00 8.8.8.88. 1
This approach merges then sorts by timestamp then creates another column - c2 - that is used to copy the df1 timestamp and then backfill it against the df2 timestamps. From there the instances are grouped by df1 timestamp (reflected in the c2 column) and counted. In other words, the backfill of the df1 timestamp allows it to be used as a grouping key for counting the preceding df2 timestamps. After that the df is trimmed back down to match the output requirements.
CodePudding user response:
Try this, it's an example of what you can do to find the solution
import pandas as pd
table1 = {
'timestamp':['2022-01-06 11:58:53 00:00','2022-01-08 03:56:35 00:00',
'2022-01-09 22:29:30 00:00','2022-03-08 22:37:52 00:00'],
'other':['other','other','other','other']
}
df1 = pd.DataFrame(table1)
table2 = {
'timestamp':['2022-01-07 23:08:59 00:00','2022-01-07 22:08:59 00:00',
'2022-03-05 17:04:09 00:00','2022-01-09 17:04:09 00:00'],
'ip':['1.1.1.5.','10.10.10.24','3.3.3.89.','8.8.8.88.']
}
df2 = pd.DataFrame(table2)
print(f'\n\n-------------df1-----------\n\n')
print(df2)
print(f'\n\n-------------df2-----------\n\n')
print(df1)
listdf1 = df1['timestamp'].values.tolist()
def func(line):
cont = df1.loc[df1['timestamp'].str.contains(line[0][:7], case = False)]
temp = line.name - 1
if temp == -1:
temp = 0
try :
cont = [cont['timestamp'].iloc[temp],line[0]]
except:
cont = [line[0],line[0]]
cont2 = df2['timestamp'].loc[df2['timestamp'].str.contains(line[0][:7], case = False)]
repetitions = 0
for x in cont2:
if int(x[8:10]) >= int(cont[0][8:10]) and int(x[8:10]) <= int(cont[1][8:10]) and int(x[8:10]) <= int(line[0][8:10]):
repetitions = 1
return repetitions
print(f'\n\n-------------BREAK-----------\n\n')
df1['count'] = df1.apply(func, axis = 1)
print(df1)
CodePudding user response:
def time_compare(df1,df2):
return [np.sum((df1['timestamp'].values[i-1] < df2['timestamp'].values) & (df1['timestamp'].values[i] > df2['timestamp'].values)) for i in range(len(df1.timestamp))]
df2.join(pd.Series(time_compare(df1,df2), name='Count'))
Weird I can't post the dataframe output as usual:
index | timestamp | other | Count |
---|---|---|---|
0 | 2022-01-07 22:08:5900:00 | other | 0 |
1 | 2022-01-07 23:08:5900:00 | other | 2 |
2 | 2022-01-09 17:04:0900:00 | other | 1 |
3 | 2022-03-05 17:04:0900:00 | other | 1 |
Please accept ✅ this answer if it solved your problem, it motivates me :)
Otherwise mention me (using @) in comment while telling me what's wrong ;)