How to match text efficiently between two DataFrames-CodePudding

I have some text data： data1

id	comment	title
user_A	good	a file name
user_B	a better way is…	is there some good sugg？
user_C	a another way is…	is there some good sugg？
user_C	I have been using Pandas for a long time, so I…	a book

You can use

pd.read_clipboard()

to replicate it.

data2

userid	title
user_X	is there some good sugg？
user_Y	a great idea…
user_Z	a file name
user_W	a book

desired output

uid	comment	title	uid
user_A	good	a file name	user_Z
user_B	a better way is…	is there some good sugg？	user_X
user_C	a another way is…	is there some good sugg？	user_X
user_C	I have been using Pandas for a long time, so I…	a book	user_W

An easy way is to merge on title In pandas :

dataall = pd.merge(
    data1,data2,
    on = 'title',
    how ='left'
)

But it‘s memory expensive. The size of data1 is (2942087, 7)(or some time maybe more than 3 times of the row numbers) and the size of data2 is (47516640, 4) My memory size is 32GB, but it‘s not enough I also try to use polars In polars :

dataall = data1.join(
    data2,
    on = 'title',
    how ='left'
)

A error occurs


Canceled future for execute_request message before replies were done

I have tried the function is_in in polars and encoding the text to number, they are fast but I don't know how to realize.
Is there an efficiency and feasible way by pandas/polars/numpy?

After the suggestion by @ritchie46
-----edit 2022-5-24 16:00:10

import polars as pl
pl.Config.set_global_string_cache()

data1 = pl.read_parquet('data1.parquet.gzip').lazy()
data2 = pl.read_parquet('data2.parquet.gzip').lazy()

data1 = data1.with_column(pl.col('source_post_title').cast(pl.Categorical))
data2 = data2.with_column(pl.col('source_post_title').cast(pl.Categorical))


dataall = data1.join(
    data2,
    on = 'source_post_title',
    how ='left'
).collect()

It seems that the code works for a period of time and then

Canceled future for execute_request message before replies were done
The Kernel crashed while executing code in the the current cell or a previous cell. Please review the code in the cell(s) to identify a possible cause of the failure. Click here for more info. View Jupyter log for further details.

Is this because my processor itself is too weak? My CPU is i7-10850H

CodePudding user response：

If you have a lot duplicate in your join keys, the output table might be a lot larger than any of the tables you are joining on.

Things that might help in polars are:

Using Categorical datatype, so that duplicates are cached.
Deduplicate your join keys, so that the output table does not explode (if it is allowed for correctness).
Use polars lazy API directly from the scan level. That way intermediate results are cleared and won't maintain in RAM. Besides that, the polars might do other optimizations that reduce memory pressure.

If you don't need all output data, but let's say only the first x million rows of the join result, you can use polars lazy.

lf_a = pl.scan_parquet("data1")
lf_a = # some more work

lf_b = p.scan_parquet("data2"_
lf_b = # some more work

# take only first million rows
N = int(1e6)

# because of the head operation the join will not materialize a full output table
lf_a.join(lf_b).head(N)