I am trying to load a flat file into a python pandas data frame. Using Python 3.8.3 and pandas version 1.0.5
The read_csv
code is like this:
import pandas as pd
df = pd.read_csv(myfile, sep='|', usecols=[0], names=["ID"],
dtype=str,
encoding='UTF-8',
memory_map=True,
low_memory=True, engine='c')
print('nb entries:', df["ID"].size)
This gives me a number of entries. However, this does not match the number of entries I get with the following code:
num_lines = sum(1 for line in open(myfile, encoding='UTF-8')
print('nb lines:', num_lines)
I don't get an error message.
I tried several options (with/without encoding, with/without low memory, with or without memory map, with or without warn_bad_lines, with the c engine or the default one), but I always got the same erroneous results.
By changing the nrows
parameters I identified where in the file the problem seems to be. And I copied the lines of interest in a test file and re-run the code on the test file. This time I get the correct result.
Now I realize that my machine is a little short on memory, so maybe some allocation is failing silently. Would there be a way to test for that? I tried running the script without any other applications open, but I got the same erroneous results.
How should I troubleshoot this type of problem?
CodePudding user response:
Something like this could be used to read the file in chunks
import pandas as pd
import numpy as np
n_rows = sum(1 for _ in open("./test.csv", encoding='UTF-8')) - 1
chunk_size = 300
n_chunks = int(np.ceil(n_rows / chunk_size))
read_lines = 0
for chunk_idx in range(n_chunks):
df = pd.read_csv("./test.csv", header=0, skiprows=chunk_idx*chunk_size, nrows=chunk_size)
read_lines = len(df)
print(read_lines)