I have 3 pandas dataframes
df_a = pd.DataFrame(data={
'id': [1, 5, 3, 2],
'ts': [3, 5, 11, 14],
'other_cols': ['...'] * 4
})
df_b = pd.DataFrame(data={
'id': [2, 1, 3],
'ts': [7, 8, 15],
'other_cols': ['...'] * 3
})
df_c = pd.DataFrame(data={
'id': [154, 237, 726, 814, 528, 237, 248, 514],
'ts': [1, 2, 4, 6, 9, 10, 12, 13],
'other_cols': ['...'] * 8
})
Here is the problem I need to solve.
- for every
id
indf_a
find the correspondingid
indf_b
and their timestamps. Lets assumets_a
andts_b
. - find all the rows in
df_c
betweenmin(ts_a, ts_b)
andmax(ts_a, ts_b)
and calculate some custom function on these rows. This function can be a pd function (in 95% of the time) but it can be any python function.
Here are examples of rows for each ids (id, ts):
- id 1:
[726, 4], [814, 6]
- id 2:
[528, 9], [237, 10], [248, 12], [514, 13]
- id 3:
[248, 12], [514, 13]
- id 5: can be found only in A, but not in B, so nothing should be done
The output does not really matter, so anything that can map id
to f(rows for that id)
would do the job.
For example let's assume that I need to apply a simple len
function on results, I will get the following results
id | res |
---|---|
1 | 2 |
2 | 4 |
3 | 2 |
If my function is max(ts) - min(ts)
, the results are:
id | res |
---|---|
1 | 2 = 6 - 4 |
2 | 4 = 13 - 9 |
3 | 1 = 13 - 12 |
Here are the assumptions on dataframes:
ids
in each corresponding tables are unique- each dataframe is sorted by
ts
- there might exist
id
indf_a
which does not exist indf_b
and wise versa (but the percentage of missed ids is less than 1%) - tables A/B can be on the size of tens of millions, table C is on the size of hundreds of millions
- although theoretically there can be any number of rows between timestamps, empirical observations found that median number is in two digit number and the maximum is slightly more than a thousand
My working solutions
Attempt 1
- create a dictionary
id -> ts
fromdf_b
. Linear in terms of length of df_b - create a sorted list of
ts, other_cols
fromdf_c
. Linear in terms of df_c as it is already sorted by ts - iterate over df_a, then for each id find the ts in dictionary. Then 2 times do binary search in sorted list to find the edges of the data which should be analyzed. Then apply the function
Attempt 2
- combine all the dataframe in one and order by ts
df = pd.concat([df_a, df_b, df_c]).sort_values(by='ts').reset_index(drop=True)
- iterate over this dataframe in a sliding window approach and maintain dictionary
seen_ids
(id -> index
) where you put ids from table A/B. If you see the id, in this dictionary, thendf.iloc[index_1:index_2]
, filter them to only rows in C and apply the function
Both attempts work correctly and run in loglinear time but for my data it takes ~20-30 mins to run, which is bearable but not ideal. On top of this there is an issue with additional memory requirement to store additional data.
My question to pandas gurus
Can this be achieved with pure pandas and be more efficient than my custom implementation?
This question is important to me, so I am more than happy to provide a 500 bounty for a solution which can beat my current solutions (in terms of speed/memory).
CodePudding user response:
Here is my latest attempt. I think it is pretty fast but of course the speed depends entirely on the contents of the tables you try it on. Let me know how it works for you.
import bisect
import functools
import numpy as np
import pandas as pd
def cartesian_product(*arrays):
"""https://stackoverflow.com/a/11146645/7059681"""
la = len(arrays)
dtype = np.result_type(*arrays)
arr = np.empty([len(a) for a in arrays] [la], dtype=dtype)
for i, a in enumerate(np.ix_(*arrays)):
arr[...,i] = a
return arr.reshape(-1, la).T
# inner join on id
df_ts_a = pd.merge(
left=df_a[["id", "ts"]],
right=df_b[["id", "ts"]],
how="inner",
on="id",
suffixes=["_a", "_b"]
)
# a = min ts, b = max ts
df_ts_a["ts_a"], df_ts_a["ts_b"] = (
df_ts_a[["ts_a", "ts_b"]].min(axis=1),
df_ts_a[["ts_a", "ts_b"]].max(axis=1),
)
# ranges sorted by both start and end times
df_ts_a.sort_values(by=["ts_a"], inplace=True)
df_ts_b = df_ts_a.sort_values(by=["ts_b"])
# rename to avoid collisions
df_c.rename(columns={"id": "id_c", "ts": "ts_c"}, inplace=True)
ts_a = df_ts_a["ts_a"].to_numpy()
ts_b = df_ts_b["ts_b"].to_numpy()
ts_c = df_c["ts_c"].to_numpy()
# an array used to check for intersections
intersection_arr = np.full_like(a=ts_a, fill_value=False, dtype=np.bool8)
df_c_idxs_list, df_ts_idxs_list = [], []
# the first item in ts_c that is at least equal to ts_a[0]
c_lo = c_idx = bisect.bisect_left(a=ts_c, x=ts_a[0])
c_hi = len(ts_c)
while c_lo < c_hi and ts_c[c_lo] <= ts_b[-1]:
# the idx before which all intervals start before ts_c[c_lo]
a_idx = bisect.bisect_right(a=ts_a, x=ts_c[c_lo])
# the idx after which all intervals end after ts_c[c_lo]
b_idx = bisect.bisect_left(a=ts_b, x=ts_c[c_lo])
# the index of the next greatest ts in ts_c
c_idx = bisect.bisect_right(a=ts_c, x=ts_c[c_lo])
# indicies of all intervals in ts_a starting before ts_c[c_lo]
unique_a_idxs = df_ts_a.iloc[:a_idx].index
# indicies of all intervals in ts_b ending after ts_c[c_lo]
unique_b_idxs = df_ts_b.iloc[b_idx:].index
# find the intersection of these intervals
# method4: https://stackoverflow.com/q/42989384/7059681
intersection_arr[unique_b_idxs] = True
unique_ts_idxs = unique_a_idxs[intersection_arr[unique_a_idxs]]
intersection_arr[unique_b_idxs] = False
# all the indicies equal to ts_c[c_lo]
unique_c_idxs = df_c.iloc[c_lo: c_idx].index
# all the pairs of these indicies
c_idxs, ts_idxs = cartesian_product(unique_c_idxs, unique_ts_idxs)
df_c_idxs_list.append(c_idxs)
df_ts_idxs_list.append(ts_idxs)
c_lo = c_idx
df_c_idxs = np.concatenate(df_c_idxs_list)
df_ts_idxs = np.concatenate(df_ts_idxs_list)
df_c_labeled = pd.concat(
[
df_ts_a.loc[df_ts_idxs, :].reset_index(drop=True),
df_c.loc[df_c_idxs, :].reset_index(drop=True)
],
axis=1
)
print(df_c_labeled)
id ts_a ts_b id_c ts_c other_cols
0 1 3 8 726 4 ...
1 1 3 8 814 6 ...
2 2 7 14 528 9 ...
3 2 7 14 237 10 ...
4 3 11 15 248 12 ...
5 2 7 14 248 12 ...
6 3 11 15 514 13 ...
7 2 7 14 514 13 ...
Now we can just do some groupby
stuff:
id_groupby = df_c_labeled.groupby(by="id")
id_groupby["ts_c"].size()
id
1 2
2 4
3 2
Name: ts_c, dtype: int64
id_groupby["ts_c"].max() - id_groupby["ts_c"].min()
id
1 2
2 4
3 1
Name: ts_c, dtype: int64
CodePudding user response:
I agree with @QuangHong. it may not be efficient for taking up this large data.
However, i tried your sample input using pandas
Merge df_a
and df_b
based on id
column. did inner
join as we need the rows which are there on both
df_merge_a_b = df_a.merge(df_b, on=['id'], how='inner')
Find the minimum and maximum of the corresponding rows
df_merge_a_b["min_ab"] = df_merge_a_b[["ts_x", "ts_y"]].min(axis=1)
df_merge_a_b["max_ab"] = df_merge_a_b[["ts_x", "ts_y"]].max(axis=1)
With the min and max in place, for each row in the dataframe, find the ids which are between min and max
def get_matching_rows(row):
min_ab = row["min_ab"]
max_ab = row["max_ab"]
result = df_c[df_c["ts"].between(min_ab, max_ab)]
print(result)
## apply custom function on result and return
df_merge_a_b.apply(lambda x: get_matching_rows(x), axis=1)
Sample output
id ts other_cols
2 726 4 ...
3 814 6 ...
id ts other_cols
6 248 12 ...
7 514 13 ...
id ts other_cols
4 528 9 ...
5 237 10 ...
6 248 12 ...
7 514 13 ...
apply the custom function and concat all the output together.
May not be super efficient.. but wanted to try the sample in pandas.
CodePudding user response:
# Set some indices, note how df_c is different.
df_a = df_a.set_index('id')
df_b = df_b.set_index('id')
# Looks like maybe your `ts` is already sorted? If so, `sort_index()` isn't necessary~
df_c = df_c.set_index('ts').sort_index()
# concat them together, then get the min and max from each ts.
df = pd.concat([df_a, df_b])
# Producing the min/max this way should be fast.
# sort=False is optional for performance and means your output will be jumbled like shown below~
df = df.groupby(level=-1, sort=False)['ts'].agg(['min', 'max'])
# Making this work with `raw=True` should improve performance.
# Being able to use `loc` should help.
out = df.apply(lambda x: df_c.loc[x[0]:x[1], 'id'].to_dict(), axis=1, raw=True)
print(out)
Output:
id
1 {4: 726, 6: 814}
5 {}
3 {12: 248, 13: 514}
2 {9: 528, 10: 237, 12: 248, 13: 514}
dtype: object
I don't have a ton of faith in this method, but I'd love to know how it turns out~
After setting and sorting (where necessary) indices, the one-liner would be:
# Only concating `ts` will be faster, no need to drag everything along.
out = (pd.concat([df_a[['ts']], df_b[['ts']]])
.groupby(level=-1, sort=False)['ts']
.agg(['min', 'max'])
.apply(lambda x: df_c.loc[x[0]:x[1], 'id'].to_dict(), axis=1, raw=True)
# See this alternative if only ts are needed:
#.apply(lambda x: set(df_c.loc[x[0]:x[1], 'id'].index), axis=1, raw=True)
)
CodePudding user response:
To add one possible optimisation to the existing answers: if there are duplicates in (min, max)
combinations, then you could perform the lookup/calculation in df_c
only for the unique (min, max)
values (or alternatively implement caching).
This could be a substantial reduction in computation if the timestamps are at a fairly low resolution (e.g. days), but probably of not much use if timestamps are at high resolution (e.g. picoseconds). Of course, if you want fast approximate answers, you could always round the timestamps to a tolerable margin of error.
In practice, this would look as follows :
from pandas import DataFrame, merge
df_a = DataFrame(
data={"id": [1, 5, 3, 2], "ts": [3, 5, 11, 14], "other_cols": ["..."] * 4}
)
df_b = DataFrame(data={"id": [2, 1, 3], "ts": [7, 8, 15], "other_cols": ["..."] * 3})
df_c = DataFrame(
data={
"id": [154, 237, 726, 814, 528, 237, 248, 514],
"ts": [1, 2, 4, 6, 9, 10, 12, 13],
"other_cols": ["..."] * 8,
}
)
# indexing and min/max are adapted the answers by @srinath, @ringo and @BeRT2me
df_a = df_a.set_index("id")["ts"] # keep only info of interest
df_b = df_b.set_index("id")["ts"] # keep only info of interest
df = merge(df_a, df_b, how="inner", left_index=True, right_index=True)
df["min"] = df[["ts_x", "ts_y"]].min(axis=1)
df["max"] = df[["ts_x", "ts_y"]].max(axis=1)
df = df[["min", "max"]]
# find unique min-max combinations (drop index to avoid confusion)
unique = df.drop_duplicates().reset_index(drop=True)
# proceed to actual calculations (below is just an example)
# make sure df_c is indexed by ts so we can lookup
df_c = df_c.set_index("ts").sort_index()
# if computation is costly this can be done in parallel, but
# AFAIK this would require using another library, e.g. dask
for tmin, tmax in unique.values:
sub = df_c.loc[tmin:tmax]
print(tmin, tmax, len(sub))
# 3 8 2
# 11 15 2
# 7 14 4