Home > Mobile >  how to read a 34Gb stata (.dta) file In python
how to read a 34Gb stata (.dta) file In python

Time:10-04

I am trying to read a 34Gb Stata (.dta) file but keep getting a "MemoryError" message it's obvious that my 16Gb ram is not enough.

I tried to test an 11Mb Stata file with:

dtafile = 'E:/test file.dta'
df = pd.read_stata(dtafile)
a = df.head()
print(a)

I got the correct output as:

   app_id    inventor_id  ...  lagged_generality_FYnormalized       _merge
0  101985                 ...                        1.038381  matched (3)
1  102019  SCHOTTEK 2827  ...                        0.830110  matched (3)
2  102019  KUELLMER 2827  ...                        0.830110  matched (3)
3  102019   DICKNER 2827  ...                        0.830110  matched (3)
4  102562    VINEGAR 986  ...                        0.825088  matched (3)

[5 rows x 1448 columns]

Process finished with exit code 0

But when I tried the same with the 34Gb file I got a "MemoryError" message. The full error message is:

Traceback (most recent call last):
  File "C:\Users\Gaju\PycharmProjects\first project\work.py", line 8, in <module>
    df = pd.read_stata(dtafile)
  File "C:\Users\Gaju\PycharmProjects\first project\venv\lib\site-packages\pandas\util\_decorators.py", line 317, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\Gaju\PycharmProjects\first project\venv\lib\site-packages\pandas\io\stata.py", line 2021, in read_stata
    reader = StataReader(
  File "C:\Users\Gaju\PycharmProjects\first project\venv\lib\site-packages\pandas\io\stata.py", line 1172, in __init__
    self.path_or_buf = BytesIO(handles.handle.read())
MemoryError

Process finished with exit code 1

CodePudding user response:

By the looks of it, Pandas's Stata parser presently always reads the entire file into memory (and transforms it into a memory stream).

This is apparently a bit of a regression – if I'm reading this diff correctly, the parser has been previously able to just use a file stream from disk.

EDIT: Coincidentally, someone has raised a bug report about this recently: https://github.com/pandas-dev/pandas/issues/48700

EDIT 2: I figured I could just as well try fix this. https://github.com/pandas-dev/pandas/pull/48922

CodePudding user response:

There are a couple of libraries made to work seamlessly with pandas (i.e. allow You to use the traditional pandas api) but are optimized for big files

Here's a link

Additionally, if You don't want to read the article just try

# pip install "modin[dask]"
import modin.pandas as pd
  • Related