I am trying to read a 34Gb Stata (.dta) file but keep getting a "MemoryError" message it's obvious that my 16Gb ram is not enough.
I tried to test an 11Mb Stata file with:
dtafile = 'E:/test file.dta'
df = pd.read_stata(dtafile)
a = df.head()
print(a)
I got the correct output as:
app_id inventor_id ... lagged_generality_FYnormalized _merge
0 101985 ... 1.038381 matched (3)
1 102019 SCHOTTEK 2827 ... 0.830110 matched (3)
2 102019 KUELLMER 2827 ... 0.830110 matched (3)
3 102019 DICKNER 2827 ... 0.830110 matched (3)
4 102562 VINEGAR 986 ... 0.825088 matched (3)
[5 rows x 1448 columns]
Process finished with exit code 0
But when I tried the same with the 34Gb file I got a "MemoryError" message. The full error message is:
Traceback (most recent call last):
File "C:\Users\Gaju\PycharmProjects\first project\work.py", line 8, in <module>
df = pd.read_stata(dtafile)
File "C:\Users\Gaju\PycharmProjects\first project\venv\lib\site-packages\pandas\util\_decorators.py", line 317, in wrapper
return func(*args, **kwargs)
File "C:\Users\Gaju\PycharmProjects\first project\venv\lib\site-packages\pandas\io\stata.py", line 2021, in read_stata
reader = StataReader(
File "C:\Users\Gaju\PycharmProjects\first project\venv\lib\site-packages\pandas\io\stata.py", line 1172, in __init__
self.path_or_buf = BytesIO(handles.handle.read())
MemoryError
Process finished with exit code 1
CodePudding user response:
By the looks of it, Pandas's Stata parser presently always reads the entire file into memory (and transforms it into a memory stream).
This is apparently a bit of a regression – if I'm reading this diff correctly, the parser has been previously able to just use a file stream from disk.
EDIT: Coincidentally, someone has raised a bug report about this recently: https://github.com/pandas-dev/pandas/issues/48700
EDIT 2: I figured I could just as well try fix this. https://github.com/pandas-dev/pandas/pull/48922
CodePudding user response:
There are a couple of libraries made to work seamlessly with pandas (i.e. allow You to use the traditional pandas api) but are optimized for big files
Here's a link
Additionally, if You don't want to read the article just try
# pip install "modin[dask]"
import modin.pandas as pd