I have some text data:
data1
id | comment | title |
---|---|---|
user_A | good | a file name |
user_B | a better way is… | is there some good sugg? |
user_C | a another way is… | is there some good sugg? |
user_C | I have been using Pandas for a long time, so I… | a book |
You can use
pd.read_clipboard()
to replicate it.
data2
userid | title |
---|---|
user_X | is there some good sugg? |
user_Y | a great idea… |
user_Z | a file name |
user_W | a book |
desired output
uid | comment | title | uid |
---|---|---|---|
user_A | good | a file name | user_Z |
user_B | a better way is… | is there some good sugg? | user_X |
user_C | a another way is… | is there some good sugg? | user_X |
user_C | I have been using Pandas for a long time, so I… | a book | user_W |
An easy way is to merge on title
In pandas
:
dataall = pd.merge(
data1,data2,
on = 'title',
how ='left'
)
But it‘s memory expensive.
The size of data1 is (2942087, 7)(or some time maybe more than 3 times of the row numbers) and the size of data2 is (47516640, 4)
My memory size is 32GB, but it‘s not enough
I also try to use polars
In polars
:
dataall = data1.join(
data2,
on = 'title',
how ='left'
)
A error occurs
Canceled future for execute_request message before replies were done
I have tried the function is_in
in polars
and encoding the text to number, they are fast but I don't know how to realize.
Is there an efficiency and feasible way by pandas/polars/numpy?
After the suggestion by @ritchie46
-----edit 2022-5-24 16:00:10
import polars as pl
pl.Config.set_global_string_cache()
data1 = pl.read_parquet('data1.parquet.gzip').lazy()
data2 = pl.read_parquet('data2.parquet.gzip').lazy()
data1 = data1.with_column(pl.col('source_post_title').cast(pl.Categorical))
data2 = data2.with_column(pl.col('source_post_title').cast(pl.Categorical))
dataall = data1.join(
data2,
on = 'source_post_title',
how ='left'
).collect()
It seems that the code works for a period of time and then
Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.
Is this because my processor itself is too weak?
My CPU is i7-10850H
CodePudding user response:
If you have a lot duplicate in your join keys, the output table might be a lot larger than any of the tables you are joining on.
Things that might help in polars
are:
- Using
Categorical
datatype, so that duplicates are cached. - Deduplicate your join keys, so that the output table does not explode (if it is allowed for correctness).
- Use polars lazy API directly from the
scan
level. That way intermediate results are cleared and won't maintain in RAM. Besides that, the polars might do other optimizations that reduce memory pressure.
If you don't need all output data, but let's say only the first x million rows of the join result, you can use polars lazy.
lf_a = pl.scan_parquet("data1")
lf_a = # some more work
lf_b = p.scan_parquet("data2"_
lf_b = # some more work
# take only first million rows
N = int(1e6)
# because of the head operation the join will not materialize a full output table
lf_a.join(lf_b).head(N)