Home > database >  How to join two very large dataframes together with same columns?
How to join two very large dataframes together with same columns?

Time:05-14

I have two datasets that look like this:

df1:

Date City State Quantity
2019-01 Chicago IL 35
2019-01 Orlando FL 322
... .... ... ...
2021-07 Chicago IL 334
2021-07 Orlando FL 4332

df2:

Date City State Sales
2019-01 Chicago IL 30
2019-01 Orlando FL 319
... ... ... ...
2021-07 Chicago IL 331
2021-07 Orlando FL 4000

They are EXTREMELY large datasets, to the point where pd.merge() and dd.merge() do not work, and my kernel gives me memory errors. However, I found that concatenating the two of those does not give me the memory error. My desired dataset, out2 looks like this:

Date City State Quantity Sales
2019-01 Chicago IL 35 30
2019-01 Orlando FL 322 319
... ... ... ... ...
2021-07 Chicago IL 334 331
2021-07 Orlando FL 4332 4000

I used the following code:

out2=dd.concat([df1,df2],join='outer')

but my new dataset looks like this:

Date City State Quantity Sales
2019-01 Chicago IL 35 NaN
2019-01 Orlando FL 322 NaN
2019-01 Chicago IL NaN 30
2019-01 Orlando FL NaN 319
... ... ... ... ...
2021-07 Chicago IL 334 NaN
2021-07 Orlando FL 4332 NaN
2021-07 Chicago IL NaN 331
2021-07 Orlando FL NaN 4000

How can I get my desired dataset without running into memory error issues, without using the pd.merge function?

CodePudding user response:

If performance is not critical, you could create a defaultdict of dict and use the first three values as the dict key and then add quantity and sales to the value dict. This would allow you to process the files without reading them into memory first.

from collections import defaultdict
from pathlib import Path

paths = [(Path.home() / 'file1.csv', 'Quantity'), (Path.home() / 'file2.csv', 'Sales')]
results = defaultdict(dict)

for path, value_column in paths:
    with path.open('r') as f:
        for line in f:
            parts = [s.strip() for s in line.split(',')]
            key = tuple(parts[0:-1])
            results[key][value_column] = parts[-1]

combined = pd.concat([pd.DataFrame(data=list(results.keys()), columns=['Date', 'City', 'State']), 
                      pd.DataFrame(list(results.values()))], axis=1)
  • Related