Home > Software design >  how to make a sparse pandas DataFrame from a csv file
how to make a sparse pandas DataFrame from a csv file

Time:10-21

I have a rather large (1.3 GB, unzipped) csv file, with 2 dense columns and 1.4 K sparse columns, about 1 M rows.

I need to make a pandas.DataFrame from it.

For small files I can simply do:

df = pd.read_csv('file.csv')

For the large file I have now, I get a memory error, clearly due to the DataFrame size (tested by sys.getsizeof(df)

Based on this document:

https://pandas.pydata.org/pandas-docs/stable/user_guide/sparse.html#migrating

it looks like I can make a DataFrame with mixed dense and sparse columns.

However, I can only see instructions to add individual sparse columns, not a chunk of them all together, from the csv file.

Reading the csv sparse columns one by one and adding them to df using:

for colname_i in names_of_sparse_columns:
    data = pd.read_csv('file.csv', usecols = [colname_i])
    df[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])

works, and df stays very small, as desired, but the execution time is absurdly long.

I tried of course:

pd.read_csv(path_to_input_csv, usecols = names_of_sparse_columns, dtype = "Sparse[float]")

but that generates this error:

NotImplementedError: Extension Array: <class 'pandas.core.arrays.sparse.array.SparseArray'> must implement _from_sequence_of_strings in order to be used in parser methods

Any idea how I can do this more efficiently?

I checked several posts, but they all seem to be after something slightly different from this.


EDIT adding a small example, to clarify

import numpy as np
import pandas as pd
import sys

# Create an unpivoted sparse dataset
lengths = list(np.random.randint(low = 1, high = 5, size = 10000))
cols = []
for l in lengths:
    cols.extend(list(np.random.choice(100, size = l, replace = False)))
rows = np.repeat(np.arange(10000), lengths)
vals = np.repeat(1, sum(lengths))
df_unpivoted = pd.DataFrame({"row" : rows, "col" : cols, "val" : vals})

# Pivot and save to a csv file
df = df_unpivoted.pivot(index = "row", columns = "col", values = "val")
df.to_csv("sparse.csv", index = False)

This file occupies 1 MB on my PC.
Instead:

sys.getsizeof(df)
# 8080016

This looks like 8 MB to me.
So there is clearly a large increase in size when making a pd.DataFrame from a sparse csv file (in this case I made the file from the data frame, but it's the same as reading in the csv file using pd.read_csv()).

And this is my point: I cannot use pd.read_csv() to load the whole csv file into memory.
Here it's only 8 MB, that's no problem at all; with the actual 1.3 GB csv I referred to, it goes to such a huge size that it crashes our machine's memory.
I guess it's easy to try that, by replacing 10000 with 1000000 and 100 with 1500 in the above simulation.

If I do instead:

names_of_sparse_columns = df.columns.values

df_sparse = pd.DataFrame()

for colname_i in names_of_sparse_columns:
    data = pd.read_csv('sparse.csv', usecols = [colname_i])
    df_sparse[colname_i] = pd.arrays.SparseArray(data.values.transpose()[0])

The resulting object is much smaller:

sys.getsizeof(df_sparse)
# 416700

In fact even smaller than the file.

And this is my second point: doing this column-by-column addition of sparse columns is very slow.

I was looking for advice on how to make df_sparse from a file like "sparse.csv" faster / more efficiently.

In fact, while I was writing this example, I noticed that:

sys.getsizeof(df_unpivoted)
# 399504

So maybe the solution could be to read the csv file line by line and unpivot it. The rest of the handling I need to do however would still require that I write out a pivoted csv, so back to square one.


EDIT 2 more information

Just as well that I describe the rest of the handling I need to do, too.

When I can use a non-sparse data frame, there is an ID column in the file:

df["ID"] = list(np.random.choice(20, df.shape[0]))

I need to make a summary of how many data exist, per ID, per data column:

df.groupby("ID").count()

The unfortunate bit is that the sparse data frame does not support this.
I found a workaround, but it's very inefficient and slow.

If anyone can advise on that aspect, too, it would be useful.

I would have guessed there would be a way to load the sparse part of the csv into some form of sparse array, and make a summary by ID.

Maybe I'm approaching this completely the wrong way, and that's why I am asking this large competent audience for advice.

CodePudding user response:

The same file.csv should not be read on every iteration; this line of code:

data = pd.read_csv('file.csv', ...)

should be moved ahead of the for-loop.

To iterate through names_of_sparse_columns:

df = pd.read_csv('file.csv', header = 0).copy()
data = pd.read_csv('file.csv', header = 0).copy()
for colname_i in names_of_sparse_columns:
    dataFromThisSparseColumn = data[colname_i]
    df[colname_i] = np.reshape(dataFromThisSparseColumn, -1)

CodePudding user response:

I don't have the faintest idea why someone would have made a CSV in that format. I would just read it in as chunks and fix the chunks.

# Read in chunks of data, melt it into an dataframe that makes sense
data = [c.melt(id_vars=dense_columns, var_name="Column_label", value_name="Thing").dropna() 
        for c in pd.read_csv('file.csv', iterator=True, chunksize=100000)]

# Concat the data together
data = pd.concat(data, axis=0)

Change the chunksize and the name of the value column as needed. You could also read in chunks and turn the chunks into a sparse dataframe if needed, but it seems that you'd be better off with a melted dataframe for what you want to do, IMO.

  • Related