Home > Software engineering >  count number of rows in one df depending on column in another df
count number of rows in one df depending on column in another df

Time:06-02

Ok so I have a first dataframe df1:

|timestamp                |ip         |
|2022-01-06 11:58:53 00:00|1.1.1.5.   |
|2022-01-08 03:56:35 00:00|10.10.10.24|
|2022-01-09 22:29:30 00:00|3.3.3.89.  |
|2022-03-08 22:37:52 00:00|8.8.8.88.  |

And a second dataframe, df2:

|timestamp                |other|
|2022-01-07 22:08:59 00:00|other|
|2022-01-07 23:08:59 00:00|other|
|2022-01-09 17:04:09 00:00|other|
|2022-03-05 17:04:09 00:00|other|

And I would like to count how many rows there is in df2 depending on the 2 consecutive timestamps in df1, meaning:

|timestamp                |ip         |count|
|2022-01-06 11:58:53 00:00|1.1.1.5    |NaN  |
|2022-01-08 03:56:35 00:00|10.10.10.24|2    |
|2022-01-09 22:29:30 00:00|3.3.3.89   |1    |
|2022-03-08 22:37:52 00:00|8.8.8.88   |1    |

What I tried is to first create another column in df1 with the previous timestamp with this:

df1 = df1.assign(timestamp_b4=df1.timestamp.shift(1)).fillna({'timestamp_b4': df1.timestamp})

which gives me:

|timestamp                |ip         |timestamp_b4             |
|2022-01-06 11:58:53 00:00|1.1.1.5    |2022-03-08 22:37:52 00:00|
|2022-01-08 03:56:35 00:00|10.10.10.24|2022-01-06 11:58:53 00:00|
|2022-01-09 22:29:30 00:00|3.3.3.89   |2022-01-08 03:56:35 00:00|
|2022-03-08 22:37:52 00:00|8.8.8.88   |2022-01-09 22:29:30 00:00|

and then do some sort of

s = (df2[df2['timestamp'].between(df1['timestamp'], df1['timestamp_b4'])].size())

But unfortunately it does not work since pandas requires comparing identically-labeled objects.

Is there a good pandas/pythonic way to do that?

Thanks

CodePudding user response:

Here is one approach:

df1.merge(df2, on='timestamp', how='outer').sort_values('timestamp') \
    .assign(c1=df1.loc[~df1['ip'].isna()]['ip'], c2=lambda x: x['c1'].bfill() ) \
    .assign(count=lambda x: x.groupby('c2').apply('count').reset_index(drop=True)['timestamp']-1) \
    .drop(['other','c1','c2'], axis=1).dropna().astype({'count': 'int32'})

                   timestamp           ip  count
0  2022-01-06 11:58:53 00:00  1.1.1.5.         0
1  2022-01-08 03:56:35 00:00  10.10.10.24      2
2  2022-01-09 22:29:30 00:00  3.3.3.89.        1
3  2022-03-08 22:37:52 00:00  8.8.8.88.        1

This approach merges then sorts by timestamp then creates another column - c2 - that is used to copy the df1 timestamp and then backfill it against the df2 timestamps. From there the instances are grouped by df1 timestamp (reflected in the c2 column) and counted. In other words, the backfill of the df1 timestamp allows it to be used as a grouping key for counting the preceding df2 timestamps. After that the df is trimmed back down to match the output requirements.

CodePudding user response:

Try this, it's an example of what you can do to find the solution

import pandas as pd
table1 = {
    'timestamp':['2022-01-06 11:58:53 00:00','2022-01-08 03:56:35 00:00',
                 '2022-01-09 22:29:30 00:00','2022-03-08 22:37:52 00:00'],
    'other':['other','other','other','other']
              }
df1 = pd.DataFrame(table1)

table2 = {
    'timestamp':['2022-01-07 23:08:59 00:00','2022-01-07 22:08:59 00:00',
                 '2022-03-05 17:04:09 00:00','2022-01-09 17:04:09 00:00'],
    'ip':['1.1.1.5.','10.10.10.24','3.3.3.89.','8.8.8.88.']
    
              }

df2 = pd.DataFrame(table2)

print(f'\n\n-------------df1-----------\n\n')
print(df2)
print(f'\n\n-------------df2-----------\n\n')
print(df1)

listdf1 = df1['timestamp'].values.tolist()
def func(line):
    cont = df1.loc[df1['timestamp'].str.contains(line[0][:7], case = False)]
    temp = line.name - 1
    if temp == -1:
        temp = 0

    try :
        cont = [cont['timestamp'].iloc[temp],line[0]]
    except:
        cont = [line[0],line[0]]

    cont2 = df2['timestamp'].loc[df2['timestamp'].str.contains(line[0][:7], case = False)]
    
    repetitions = 0
    for x in cont2:

        if int(x[8:10]) >= int(cont[0][8:10]) and int(x[8:10]) <= int(cont[1][8:10]) and int(x[8:10]) <= int(line[0][8:10]):
            repetitions  = 1
    return repetitions
    

print(f'\n\n-------------BREAK-----------\n\n')

df1['count'] = df1.apply(func, axis = 1)

print(df1)

CodePudding user response:

def time_compare(df1,df2):
  return [np.sum((df1['timestamp'].values[i-1] < df2['timestamp'].values) & (df1['timestamp'].values[i] > df2['timestamp'].values)) for i in range(len(df1.timestamp))]

df2.join(pd.Series(time_compare(df1,df2), name='Count'))

Weird I can't post the dataframe output as usual:

index timestamp other Count
0 2022-01-07 22:08:5900:00 other 0
1 2022-01-07 23:08:5900:00 other 2
2 2022-01-09 17:04:0900:00 other 1
3 2022-03-05 17:04:0900:00 other 1

Please accept ✅ this answer if it solved your problem, it motivates me :)

Otherwise mention me (using @) in comment while telling me what's wrong ;)

  • Related