select row in heavy csv-CodePudding

i search how can i select some row with word in line so i use this script

import pandas
import datetime


df = pandas.read_csv(
    r"C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv",
    sep=",",
)


communes = ["PERPIGNAN"]


print()

df = df[~df["libelleCommuneEtablissement"].isin(communes)]

print()

so my script work well with a normal csv

but with a heavy Csv (4Go) the scipt say :

Traceback (most recent call last):
  File "C:lafinessedufiness.py", line 5, in <module>
    df = pandas.read_csv(r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv',
  File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\util\_decorators.py", line 311, in wrapper
    return func(*args, **kwargs)
  File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 680, in read_csv
    return _read(filepath_or_buffer, kwds)
  File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 581, in _read
    return parser.read(nrows)
  File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\readers.py", line 1250, in read
    index, columns, col_dict = self._engine.read(nrows)
  File "C:\Users\\AppData\Local\Programs\Python\Python38-32\lib\site-packages\pandas\io\parsers\c_parser_wrapper.py", line 225, in read
    chunks = self._reader.read_low_memory(nrows)
  File "pandas\_libs\parsers.pyx", line 805, in pandas._libs.parsers.TextReader.read_low_memory
  File "pandas\_libs\parsers.pyx", line 883, in pandas._libs.parsers.TextReader._read_rows
  File "pandas\_libs\parsers.pyx", line 1026, in pandas._libs.parsers.TextReader._convert_column_data
  File "pandas\_libs\parsers.pyx", line 1072, in pandas._libs.parsers.TextReader._convert_tokens
  File "pandas\_libs\parsers.pyx", line 1172, in pandas._libs.parsers.TextReader._convert_with_dtype
  File "pandas\_libs\parsers.pyx", line 1731, in pandas._libs.parsers._try_int64
MemoryError: Unable to allocate 128. KiB for an array with shape (16384,) and data type int64

do you know how can i fix this error please?

CodePudding user response：

The pd.read_csv() function has an option to read the file in chunks, rather than loading it all at once. Use iterator=True and specify a reasonable chunk size (rows per chunk).

import pandas as pd

path = r'C:StockEtablissement_utf8(1)\StockEtablissement_utf8.csv'
it = pd.read_csv(path, sep=',', iterator=True, chunksize=10_000)

communes = ['PERPIGNAN']

filtered_chunks = []
for chunk_df in it:
    chunk_df = chunk_df.query('"libelleCommuneEtablissement" not in @communes')
    filtered_chunks.append(chunk_df)

df = pd.concat(filtered_chunks)

CodePudding user response：

As you can see, you don't have enough memory available for Pandas to load that file entirely into memory.

One reason is that based on Python38-32 in the traceback, you're running a 32-bit version of Python, where 4 gigabytes (or is it 3 gigabytes?) is the limit for memory allocations anyway. If your system is 64-bit, you should switch to the 64-bit version of Python, so that's one obstacle less.

If that doesn't help, you'll just also need more memory. You could configure Windows's virtual memory, or buy more actual memory and install it in your system.

If those don't help, then you'll have to come up with a better approach than to load the big CSV entirely into memory.

For one, if you really only care about rows with the string PERPIGNAN (no matter the column; you can really filter it again in your code), you could do grep PERPIGNAN data.csv > data_perpignan.csv and work with that (assuming you have grep; you can do the same filtering with a short Python script).

Since read_csv() accepts any iterable of lines, you can also just do something like

def lines_from_file_including_strings(file, strings):
    for i, line in enumerate(file):
        if i == 0 or any(string in line for string in strings):
            yield line


communes = ["PERPIGNAN", "PARIS"]

with open("StockEtablissement_utf8.csv") as f:
  df = pd.read_csv(lines_from_file_including_strings(f, communes), sep=",")

for an initial filter.