I am working on analyzing some very large files (~200 million rows)
csv_filename=pd.read_csv('filename.txt',sep="\t",error_bad_lines=False)
The program runs for about a half an hour before I get this error message:
MemoryError: Unable to allocate 3.25 GiB for an array with shape (7, 62388743) and data type object
I'm wondering if there is a way to bypass this memory error, or if there is a different function I can use that won't require as much memory? I have split the file into pieces, but the issue with that is that I need all of the data in one dataframe so that I can analyze it as a whole.
CodePudding user response:
You can get a computer with more memory, simplify/aggregate the data, or try a different tool.
If I were doing this, working with a file that large, I would switch over to Spark and use the PySpark API.
PySpark will give you the ability to use pandas operations while exceeding the available memory on the machine. It is great for big data operations.
http://spark.apache.org/docs/latest/api/python/
CodePudding user response:
In [49]: 7*62388743*8/1e9
Out[49]: 3.493769608
There's your 3.25 GiB. That's just for the pointers to the objects.
('field1', dtype('int64')) ('field2', dtype('int64')) ('field3', dtype('O')) ('field4', dtype('O')) ('field5', dtype('O')) ('field6', dtype('O')) ('field7', dtype('O')) ('field8', dtype('O')) ('field9', dtype('O')) ('field10', dtype('int64')) ('field11', dtype('float64')) ('field12', dtype('int64'))
I count 7 dtype('O')
fields; that could account for the 7 in the shape; the other value I assume is rows. The int64
and float fields will be put in their own arrays.
In pandas
columns with strings are object dtype; that is, it stores strings as regular Python strings, and hence the required pointers.
By that field list, 1,10,11,12 are all numeric. But again, pandas
may change them to object if the contents aren't "correct", with string, None, NA etc.
CodePudding user response:
You can limit the number of columns with usecols
. This will reduce the memory footprint. You also seem to have some bad data in the CSV file making columns you think should be int64
to be object
. These could be empty cells, or any non-digit value. Here is an example that will read the csv and then scan for bad data. This example uses commas, not tab, because thats a bit easier to demonstrate.
import pandas as pd
import numpy as np
import io
import re
test_csv = io.StringIO("""field1,field2,field3,other
1,2,3,this
4,what?,6,is
7,,9,extra""")
_numbers_re = re.compile(r"\d $")
df = pd.read_csv(test_csv,sep=",",error_bad_lines=False,
usecols=['field1', 'field2', 'field3'])
print(df)
# columns that arent int64
bad_cols = list(df.dtypes[df.dtypes!=np.dtype('int64')].index)
if bad_cols:
print("bad cols", bad_cols)
for bad_col in bad_cols:
col = df[bad_col]
bad = col[col.str.match(_numbers_re) != True]
print(bad)
exit(1)
CodePudding user response:
You could read your csv file line by line and calculate metrics while consuming iterator, without growing memory usage. But you don't have to do it from scratch (see below).
Have a look at convtools, it is a lightweight python library, which allows you to define conversions and when you are done, it writes & compiles ad-hoc python code under the hood, so you have the function which does the thing you wanted.
I generated a test csv file (81MB; 2M rows; 3 columns), using the code below
import csv
from tqdm import tqdm
from random import random, choice
field_1_values = [
"abcde1234567890",
"cde1234567890",
"def1234567890",
"fgh1234567890",
]
with open("input_data-small.csv", "w") as f:
writer = csv.writer(f)
writer.writerow(["field_1", "field_2", "field_3"])
for index in tqdm(range(2000000)):
writer.writerow([choice(field_1_values), random(), index])
Let's process it using convtools
to find e.g.:
- unique values of field_1
- first value of field_2
- max value of field_2
- avg value of field_2
- row holding min value of field_2
import csv
# pip install convtools
from convtools import conversion as c
# pip install tqdm (for a nice progress bar)
from tqdm import tqdm
# define optional conversions to prepare data
column_to_prepare_conversion = {
"field_2": c.this().as_type(float),
"field_3": c.call_func(int, c.this()),
}
with open("input_data-small.csv", "r") as f:
# csv.DictReader is slow, so let's work with raw rows
reader = csv.reader(f)
# fetch column names
column_names = next(reader)
# adding some free syntactic sugar (no overhead in compiled code):
# allows referencing columns by names, while indexes are used under the hood
def column_(column_name):
return c.item(column_names.index(column_name))
if column_to_prepare_conversion:
prepare_input = c.iter(
# creates iterable of prepared tuples
tuple(
column_(column_name).pipe(
column_to_prepare_conversion[column_name]
)
if column_name in column_to_prepare_conversion
else column_(column_name)
for column_name in column_names
)
)
else:
prepare_input = c.this()
# let's aggregate something
converter = prepare_input.pipe(
# of course there is group_by too: https://convtools.readthedocs.io/en/latest/cheatsheet.html#group-by-simple
c.aggregate(
{
"unique_field_1": c.ReduceFuncs.ArrayDistinct(
column_("field_1")
),
"first_field_2": c.ReduceFuncs.First(column_("field_2")),
"max_field_2": c.ReduceFuncs.Max(column_("field_2")),
"avg_field_2": c.ReduceFuncs.Average(column_("field_2")),
"row_with_min_field_2": c.ReduceFuncs.MinRow(
column_("field_2")
),
}
)
).gen_converter(
# if black is installed, it will print black-formatted code
debug=True
)
results = converter(tqdm(reader))
print(results)
RESULTS:
# 2000000it [00:04, 462371.13it/s]
{
"unique_field_1": [
"cde1234567890",
"fgh1234567890",
"abcde1234567890",
"def1234567890",
],
"first_field_2": 4.149475385772927e-05,
"max_field_2": 0.9999996797416377,
"avg_field_2": 0.49995239963138766,
"row_with_min_field_2": ("fgh1234567890", 3.6033313821626223e-07, 425211),
}
So, if my test file had 200 milling rows, processing 462K row/s it would take ~ 433 seconds (7 min 13 seconds).
JFYI: simple file reading without any data preparation and calculations takes ~ 203 seconds (3 min 23 seconds):
with open("input_data-small.csv", "r") as f:
reader = csv.reader(f)
for row in tqdm(reader):
pass
# 2000001it [00:02, 983404.81it/s]