For the same ids in df_a and df_b, apply the custom function for df

I have 3 pandas dataframes

df_a = pd.DataFrame(data={
  'id': [1, 5, 3, 2],
  'ts': [3, 5, 11, 14],
  'other_cols': ['...'] * 4
})

df_b = pd.DataFrame(data={
  'id': [2, 1, 3],
  'ts': [7, 8, 15],
  'other_cols': ['...'] * 3
})

df_c = pd.DataFrame(data={
  'id': [154, 237, 726, 814, 528, 237, 248, 514],
  'ts': [1, 2, 4, 6, 9, 10, 12, 13],
  'other_cols': ['...'] * 8
})

Here is the problem I need to solve.

for every id in df_a find the corresponding id in df_b and their timestamps. Lets assume ts_a and ts_b.
find all the rows in df_c between min(ts_a, ts_b) and max(ts_a, ts_b) and calculate some custom function on these rows. This function can be a pd function (in 95% of the time) but it can be any python function.

Here are examples of rows for each ids (id, ts):

id 1: [726, 4], [814, 6]
id 2: [528, 9], [237, 10], [248, 12], [514, 13]
id 3: [248, 12], [514, 13]
id 5: can be found only in A, but not in B, so nothing should be done

The output does not really matter, so anything that can map id to f(rows for that id) would do the job.

For example let's assume that I need to apply a simple len function on results, I will get the following results

id	res
1	2
2	4
3	2

If my function is max(ts) - min(ts), the results are:

id	res
1	2 = 6 - 4
2	4 = 13 - 9
3	1 = 13 - 12

Here are the assumptions on dataframes:

ids in each corresponding tables are unique
each dataframe is sorted by ts
there might exist id in df_a which does not exist in df_b and wise versa
tables A/B can be on the size of tens of millions, table C is on the size of hundreds of millions
although theoretically there can be any number of rows between timestamps, empirical observations found that median number is in two digit number and the maximum is slightly more than a thousand

My working solutions

Attempt 1

create a dictionary id -> ts from df_b. Linear in terms of length of df_b
create a sorted list of ts, other_cols from df_c. Linear in terms of df_c as it is already sorted by ts
iterate over df_a, then for each id find the ts in dictionary. Then 2 times do binary search in sorted list to find the edges of the data which should be analyzed. Then apply the function

Attempt 2

combine all the dataframe in one and order by ts df = pd.concat([df_a, df_b, df_c]).sort_values(by='ts').reset_index(drop=True)
iterate over this dataframe in a sliding window approach and maintain dictionary seen_ids (id -> index) where you put ids from table A/B. If you see the id, in this dictionary, then df.iloc[index_1:index_2], filter them to only rows in C and apply the function

Both attempts work correctly and run in loglinear time but for my data it takes ~20-30 mins to run, which is bearable but not ideal. On top of this there is an issue with additional memory requirement to store additional data.

My question to pandas gurus

Can this be achieved with pure pandas and be more efficient than my custom implementation?

This question is important to me, so I am more than happy to provide a 500 bounty for a solution which can beat my current solutions (in terms of speed/memory).

CodePudding user response：

Credit to this answer. I'll post some times in a bit, but I'm confident this will give you the speed you are looking for.

# inner join on id
df_ts = pd.merge(
    left=df_a[["id", "ts"]],
    right=df_b[["id", "ts"]],
    how="inner",
    on="id",
    suffixes=["_a", "_b"]
)

# a = min ts, b = max ts
df_ts["ts_a"], df_ts["ts_b"] = (
    df_ts[["ts_a", "ts_b"]].min(axis=1),
    df_ts[["ts_a", "ts_b"]].max(axis=1),
)

# rename to avoid collisions
df_c.rename(columns={"id": "id_c", "ts": "ts_c"}, inplace=True)

# time for some magic
import numpy as np

ts_c = df_c["ts_c"].to_numpy()
ts_a = df_ts["ts_a"].to_numpy()
ts_b = df_ts["ts_b"].to_numpy()

df_c_idxs, df_ts_idxs = np.where(
    (ts_c[:, None] >= ts_a) & (ts_c[:, None] <= ts_b)
)

df_c_labeled = pd.concat([
    df_ts.loc[df_ts_idxs, :].reset_index(drop=True),
    df_c.loc[df_c_idxs, :].reset_index(drop=True)
], axis=1)

print(df_c_labeled)

   id  ts_a  ts_b  id_c  ts_c other_cols
0   1     3     8   726     4        ...
1   1     3     8   814     6        ...
2   2     7    14   528     9        ...
3   2     7    14   237    10        ...
4   3    11    15   248    12        ...
5   2     7    14   248    12        ...
6   3    11    15   514    13        ...
7   2     7    14   514    13        ...

Now we can just do some groupby stuff:

id_groupby = df_c_labeled.groupby(by="id")

id_groupby["ts_c"].size()

id
1    2
2    4
3    2
Name: ts_c, dtype: int64

id_groupby["ts_c"].max() - id_groupby["ts_c"].min()

id
1    2
2    4
3    1
Name: ts_c, dtype: int64

CodePudding user response：

I agree with @QuangHong. it may not be efficient for taking up this large data.

However, i tried your sample input using pandas

Merge df_a and df_b based on id column. did inner join as we need the rows which are there on both

df_merge_a_b = df_a.merge(df_b, on=['id'], how='inner')

Find the minimum and maximum of the corresponding rows

df_merge_a_b["min_ab"] = df_merge_a_b[["ts_x", "ts_y"]].min(axis=1)
df_merge_a_b["max_ab"] = df_merge_a_b[["ts_x", "ts_y"]].max(axis=1)

With the min and max in place, for each row in the dataframe, find the ids which are between min and max

def get_matching_rows(row):
    min_ab = row["min_ab"]
    max_ab = row["max_ab"]
    result = df_c[df_c["ts"].between(min_ab, max_ab)] 
    print(result)
    ## apply custom function on result and return
    
    
df_merge_a_b.apply(lambda x: get_matching_rows(x), axis=1)

Sample output

    id  ts other_cols
2  726   4        ...
3  814   6        ...
    id  ts other_cols
6  248  12        ...
7  514  13        ...
    id  ts other_cols
4  528   9        ...
5  237  10        ...
6  248  12        ...
7  514  13        ...

apply the custom function and concat all the output together.

May not be super efficient.. but wanted to try the sample in pandas.