Home > OS >  Is there a way to improve the explode function of pandas frame?
Is there a way to improve the explode function of pandas frame?

Time:10-03

I have a csv file with the '|' as the delimiter of some column values. I used the following function to reconstruct the data frame, but it eats a lot of memory and can't continue after a while on my Mac 16G computer.

df = df.assign(tag=df.tag.str.split('|')).explode(tag).drop_duplicates()

The size of my file is 1.1G and around 1.6 million rows. The performance monitor shows that the Python process is consuming over 30G memory and can be hardly continued so I have to kill it. There is a lot of memory swap.

One example of such rows is that:

id  tag         uid         date
a|b|c   label       1003252452  20210929

And I want to reconstruct it as follows:

id  tag         uid         date
a   label       1003252452  20210929
b   label       1003252452  20210929
c   label       1003252452  20210929

There may be multiple columns in the same row that need to be split as above. Is there a more memory efficient way to do this reconstruction of dataframe?

CodePudding user response:

Can you try splitting just the ids column, then using numpy to reconstruct the exploded DataFrame?

# `pop` modifies the DataFrame inplace without making a copy. 
# Make sure you run this line just ONCE.
ids = df.pop('id')

pd.DataFrame(
    np.column_stack([
        ids.str.split('|').explode(), 
        df.values.repeat(ids.str.count(r'\|') 1, axis=0)]), 
    columns=[ids.name, *df])

  id    tag         uid      date
0  a  label  1003252452  20210929
1  b  label  1003252452  20210929
2  c  label  1003252452  20210929

If numpy isn't good enough to get the job done, consider distributing the operation using dask or similar

  • Related