How to join two very large dataframes together with same columns?-CodePudding

I have two datasets that look like this:

df1:

Date	City	State	Quantity
2019-01	Chicago	IL	35
2019-01	Orlando	FL	322
...	....	...	...
2021-07	Chicago	IL	334
2021-07	Orlando	FL	4332

df2:

Date	City	State	Sales
2019-01	Chicago	IL	30
2019-01	Orlando	FL	319
...	...	...	...
2021-07	Chicago	IL	331
2021-07	Orlando	FL	4000

They are EXTREMELY large datasets, to the point where pd.merge() and dd.merge() do not work, and my kernel gives me memory errors. However, I found that concatenating the two of those does not give me the memory error. My desired dataset, out2 looks like this:

Date	City	State	Quantity	Sales
2019-01	Chicago	IL	35	30
2019-01	Orlando	FL	322	319
...	...	...	...	...
2021-07	Chicago	IL	334	331
2021-07	Orlando	FL	4332	4000

I used the following code:

out2=dd.concat([df1,df2],join='outer')

but my new dataset looks like this:

Date	City	State	Quantity	Sales
2019-01	Chicago	IL	35	NaN
2019-01	Orlando	FL	322	NaN
2019-01	Chicago	IL	NaN	30
2019-01	Orlando	FL	NaN	319
...	...	...	...	...
2021-07	Chicago	IL	334	NaN
2021-07	Orlando	FL	4332	NaN
2021-07	Chicago	IL	NaN	331
2021-07	Orlando	FL	NaN	4000

How can I get my desired dataset without running into memory error issues, without using the pd.merge function?

CodePudding user response：

If performance is not critical, you could create a defaultdict of dict and use the first three values as the dict key and then add quantity and sales to the value dict. This would allow you to process the files without reading them into memory first.

from collections import defaultdict
from pathlib import Path

paths = [(Path.home() / 'file1.csv', 'Quantity'), (Path.home() / 'file2.csv', 'Sales')]
results = defaultdict(dict)

for path, value_column in paths:
    with path.open('r') as f:
        for line in f:
            parts = [s.strip() for s in line.split(',')]
            key = tuple(parts[0:-1])
            results[key][value_column] = parts[-1]

combined = pd.concat([pd.DataFrame(data=list(results.keys()), columns=['Date', 'City', 'State']), 
                      pd.DataFrame(list(results.values()))], axis=1)