Home > Software engineering >  Pandas read_csv fails silently
Pandas read_csv fails silently

Time:03-11

I am trying to load a flat file into a python pandas data frame. Using Python 3.8.3 and pandas version 1.0.5

The read_csv code is like this:

import pandas as pd
df =  pd.read_csv(myfile, sep='|', usecols=[0], names=["ID"],
                    dtype=str,
                    encoding='UTF-8', 
                    memory_map=True,
                    low_memory=True, engine='c')
print('nb entries:', df["ID"].size)

This gives me a number of entries. However, this does not match the number of entries I get with the following code:

num_lines = sum(1 for line in open(myfile, encoding='UTF-8')
print('nb lines:', num_lines)

I don't get an error message.

I tried several options (with/without encoding, with/without low memory, with or without memory map, with or without warn_bad_lines, with the c engine or the default one), but I always got the same erroneous results.

By changing the nrows parameters I identified where in the file the problem seems to be. And I copied the lines of interest in a test file and re-run the code on the test file. This time I get the correct result.

Now I realize that my machine is a little short on memory, so maybe some allocation is failing silently. Would there be a way to test for that? I tried running the script without any other applications open, but I got the same erroneous results.

How should I troubleshoot this type of problem?

CodePudding user response:

Something like this could be used to read the file in chunks

import pandas as pd
import numpy as np

n_rows = sum(1 for _ in open("./test.csv", encoding='UTF-8')) - 1
chunk_size = 300
n_chunks = int(np.ceil(n_rows / chunk_size))


read_lines = 0
for chunk_idx in range(n_chunks):
    df = pd.read_csv("./test.csv", header=0, skiprows=chunk_idx*chunk_size, nrows=chunk_size)
    read_lines  = len(df)

print(read_lines)
  • Related