I would like to create a DataFrame that has an "index" (integer) from a number of (sparse) Series, where the index (or primary key) is NOT necessarily consecutive integers. Each Series is like a vector of (index, value)
tuple or {index: value}
mapping.
(1) A small example
In Pandas, this is very easy as we can create a DataFrame at a time, like
>>> pd.DataFrame({
"A": {0: 'a', 20: 'b', 40: 'c'},
"B": {10: 'd', 20: 'e', 30: 'f'},
"C": {20: 'g', 30: 'h'},
}).sort_index()
A B C
0 a NaN NaN
10 NaN d NaN
20 b e g
30 NaN f h
40 c NaN NaN
but I can't find an easy way to achieve a similar result with Polars. As described in Coming from Pandas, Polars does not use an index unlike Pandas, and each row is indexed by its integer position in the table; so I might need to represent an "indexed" Series with a 2-column DataFrame:
A = pl.DataFrame({ "index": [0, 20, 40], "A": ['a', 'b', 'c'] })
B = pl.DataFrame({ "index": [10, 20, 30], "B": ['d', 'e', 'f'] })
C = pl.DataFrame({ "index": [20, 30], "C": ['g', 'h'] })
I tried to combine these multiple DataFrames, joining on the index
column:
>>> A.join(B, on='index', how='outer').join(C, on='index', how='outer').sort(by='index')
shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═══════╪══════╪══════╪══════╡
│ 0 ┆ a ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10 ┆ null ┆ d ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 20 ┆ b ┆ e ┆ g │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 30 ┆ null ┆ f ┆ h │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 40 ┆ c ┆ null ┆ null │
└───────┴──────┴──────┴──────┘
This gives the result I want, but I wonder:
- (i) if there is there more concise way to do this over many columns, and
- (ii) how make this operation as efficient as possible.
Alternatives?
I also tried outer joins as this is one way to combine Dataframes with different number of columns and rows, as described above.
Other alternatives I tried includes diagonal concatenation, but this does not deduplicate or join on index
:
>>> pl.concat([A, B, C], how='diagonal')
index A B C
0 0 a None None
1 20 b None None
2 40 c None None
3 10 None d None
4 20 None e None
5 30 None f None
6 20 None None g
7 30 None None h
(2) Efficiently Building a Large Table
The approach I found above gives desired results I'd want but I feel there must be a better way in terms of performance. Consider a case with more large tables; say 300,000 rows and 20 columns:
N, C = 300000, 20
pls = []
pds = []
for i in range(C):
A = pl.DataFrame({
"index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
f"A{i}": np.arange(N, dtype=np.float32),
})
pls.append(A)
B = A.to_pandas().set_index("index")
pds.append(B)
The approach of joining two columns in a row is somewhat slow than I expected:
%%time
F = functools.reduce(lambda a, b: a.join(b, on='index', how='outer'), pls)
F.sort(by='index')
CPU times: user 1.56 s, sys: 157 ms, total: 1.72 s
Wall time: 741 ms
or than one-pass creation in pd.DataFrame:
%%time
pd.DataFrame({
f"A{i}": pds[i][f'A{i}'] for i in range(C)
}).sort_index()
CPU times: user 279 ms, sys: 88.8 ms, total: 368 ms
Wall time: 369 ms
CodePudding user response:
Following your example, but only informing polars
on the fact that the "index"
column is sorted (polars will use fast paths if data is sorted).
You can use align_frames
together with functools.reduce
to get what you want.
This is your data creation snippet:
import functools
import polars as pl
N, C = 300000, 20
pls = []
pds = []
for i in range(C):
A = pl.DataFrame({
"index": np.linspace(i, N*3-i, num=N, dtype=np.int32),
f"A{i}": np.arange(N, dtype=np.float32),
}).with_column(pl.col("index").set_sorted())
pls.append(A)
B = A.to_pandas().set_index("index")
pds.append(B)
Creating the frame aligned by index. We need to use functools.reduce
because align_frames
returns a list of new DataFrame
objects that are aligned by index
.
frames = pl.align_frames(*pls, on="index")
functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)
Performance
The performance is better than the pandas sort_index
method.
Pandas
>>> %%timeit
>>> pd.DataFrame({
... f"A{i}": pds[i][f'A{i}'] for i in range(C)
... }).sort_index()
389 ms ± 8.96 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
Polars
>>> %%timeit
>>> frames = pl.align_frames(*pls, on="index")
>>> functools.reduce(lambda a, b: a.with_columns(b.get_columns()), frames)
348 ms ± 11.9 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
CodePudding user response:
Quick solution
I've tried a couple of examples but I think pl.from_pandas
is faster than any native polars solution I could find.
dt = pl.from_pandas(
pd.DataFrame({
"A": {0: 'a', 20: 'b', 40: 'c'},
"B": {10: 'd', 20: 'e', 30: 'f'},
"C": {20: 'g', 30: 'h'},
}).sort_index().reset_index()
)
shape: (5, 4)
┌───────┬──────┬──────┬──────┐
│ index ┆ A ┆ B ┆ C │
│ --- ┆ --- ┆ --- ┆ --- │
│ i64 ┆ str ┆ str ┆ str │
╞═══════╪══════╪══════╪══════╡
│ 0 ┆ a ┆ null ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 10 ┆ null ┆ d ┆ null │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 20 ┆ b ┆ e ┆ g │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 30 ┆ null ┆ f ┆ h │
├╌╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┼╌╌╌╌╌╌┤
│ 40 ┆ c ┆ null ┆ null │
└───────┴──────┴──────┴──────┘
UPDATE
Finally found a faster solution using lazyframe joins. You can shave more time if you happen to know all the values of the index ahead of time, instead of constructing it from an unique.
# Convert tables to lazy tables
pls = [dt.lazy() for dt in pls]
%%time
out = (
pl.concat([dt.melt('index') for dt in pls])
.select(pl.col('index').unique())
)
for i in range(20):
out = out.join(pls[i], on='index', how='left')
out.collect()
CPU times: total: 766 ms
Wall time: 277 ms