Home > front end >  What could be wrong with a Pandas DataFrame?
What could be wrong with a Pandas DataFrame?

Time:04-23

I couldn't make head or tail of this: I have a function that reads a bunch of csv files from a S3 bucket, concats them and returns the DataFrame:

def create_df():
  df1 = pd.read_csv(s3_path   'file_1.csv')
  df2 = pd.read_csv(s3_path   'file_2.csv')

  return pd.concat([df1, df2], ignore_index=True)

My second function performs aggregation:

def aggregate_values(df):
  columns = ['col_c', 'col_d']
  new_df = df.groupby(columns, as_index=False) \
    .agg({'col_a': 'sum', 'col_b': 'mean'})

  return new_df

The function aggregate_values fails due to an Out of Memory error.

df = create_df()

# OOM error !!!
new_df = aggregate_values(df)

The curious thing is if I write out the DataFrame to my local file system and then read it back in, the aggregation works without a glitch on the new DataFrame.

df = create_df()
df.to_csv('path_to_store/f.csv', index=False)

df2 = pd.read_csv('path_to_store/f.csv')

# works fine!!!
new_df = aggregate_values(df2)

My guess is that there is something wrong with the DataFrame that's returned by create_df(). By writing it out and reading it back in Pandas somehow corrects the problem. But I want to find out exactly what's wrong with the DataFrame.

How do I go about debugging this problem?

Edited

I have 32G RAM on the machine running the code. The DataFrame has about 2 million records and takes about 0.5G storage and memory space.

CodePudding user response:

You didn't comment on your memory size, nor the size of the .CSV's or data frames. But I can hazard some guesses. I have a few observations:

  1. After concatenating, you have an opportunity to nuke df1 & df2. Simply assign None to them, to reclaim some RAM.
  2. The .groupby() possibly accepted dense numpy arrays and produced higher-overhead sparse arrays. You might want to look into that.
  3. Computing the mean might have turned e.g. int8 into float64, which clearly would consume more space.
  4. Here's the biggest item: .groupby() likely returned a (mutable) view rather than a brand new numpy array. Roundtrip to the filesystem fixes that, but a simple .copy() would have the same effect, so try that.

Please post an answer here, letting us know how it went.

CodePudding user response:

I found the problem and it turned out to be a good one.

One of the columns in the DataFrame returned by create_df() has dtype category. Not sure how this came about as it is meant to be string. This column is used in the grouper for the aggregation. By writing the DataFrame to the local file system and then reading it back in, the column got re-interpreted correctly by Pandas as string.

With the correct data type, the aggregation runs and returns quickly and there is no memory issue whatsoever. I do not understand why grouping on a categorical column would cause OOM error --- that is a topic for another day.

  • Related