I have two datasets that look like this:
df1
:
Date | City | State | Quantity |
---|---|---|---|
2019-01 | Chicago | IL | 35 |
2019-01 | Orlando | FL | 322 |
... | .... | ... | ... |
2021-07 | Chicago | IL | 334 |
2021-07 | Orlando | FL | 4332 |
df2
:
Date | City | State | Sales |
---|---|---|---|
2019-01 | Chicago | IL | 30 |
2019-01 | Orlando | FL | 319 |
... | ... | ... | ... |
2021-07 | Chicago | IL | 331 |
2021-07 | Orlando | FL | 4000 |
They are EXTREMELY large datasets, to the point where pd.merge()
and dd.merge()
do not work, and my kernel gives me memory errors. However, I found that concatenating the two of those does not give me the memory error. My desired dataset, out2
looks like this:
Date | City | State | Quantity | Sales |
---|---|---|---|---|
2019-01 | Chicago | IL | 35 | 30 |
2019-01 | Orlando | FL | 322 | 319 |
... | ... | ... | ... | ... |
2021-07 | Chicago | IL | 334 | 331 |
2021-07 | Orlando | FL | 4332 | 4000 |
I used the following code:
out2=dd.concat([df1,df2],join='outer')
but my new dataset looks like this:
Date | City | State | Quantity | Sales |
---|---|---|---|---|
2019-01 | Chicago | IL | 35 | NaN |
2019-01 | Orlando | FL | 322 | NaN |
2019-01 | Chicago | IL | NaN | 30 |
2019-01 | Orlando | FL | NaN | 319 |
... | ... | ... | ... | ... |
2021-07 | Chicago | IL | 334 | NaN |
2021-07 | Orlando | FL | 4332 | NaN |
2021-07 | Chicago | IL | NaN | 331 |
2021-07 | Orlando | FL | NaN | 4000 |
How can I get my desired dataset without running into memory error issues, without using the pd.merge
function?
CodePudding user response:
If performance is not critical, you could create a defaultdict of dict and use the first three values as the dict key and then add quantity and sales to the value dict. This would allow you to process the files without reading them into memory first.
from collections import defaultdict
from pathlib import Path
paths = [(Path.home() / 'file1.csv', 'Quantity'), (Path.home() / 'file2.csv', 'Sales')]
results = defaultdict(dict)
for path, value_column in paths:
with path.open('r') as f:
for line in f:
parts = [s.strip() for s in line.split(',')]
key = tuple(parts[0:-1])
results[key][value_column] = parts[-1]
combined = pd.concat([pd.DataFrame(data=list(results.keys()), columns=['Date', 'City', 'State']),
pd.DataFrame(list(results.values()))], axis=1)