Alternatives to .explode() when turning a colum of list into a single colum-CodePudding

So by far whenever I had a dataframe that has a column of list such as the following:

'category_id'
[030000, 010403, 010402, 030604, 234440]
[030000, 010405, 010402, 030604, 033450]
[030000, 010403, 010407, 030604, 030600]
[030000, 010403, 010402, 030609, 032600]

Usually whenever I want to make this category_id column become like:

'category_id'
030000
010403
010402
030604
234440
030000
010405
010402
030604
033450

I would usually use the following code:

df2 = df.explode('category_id')

But whenever my data size gets really big, likes the sales data over the course of an entire month, .explode() becomes extremely slow and I am always worried whether I would encounter any OOM issues related to memory leaks.

Is there any other alternative solutions to .explode that would somehow perform better? I tried to to use flatMap() but I'm stuck on how to exactly turn a dataframe to rdd format and then change it back to dataframe format that I can utilize.

Any info would be appreciated.

CodePudding user response：

It might not be the fastest method, but you can simply explode each row of the pandas frame, and combine:

import pandas

df = pandas.DataFrame({"col1":[[12,34,12,34,45,56], [12,14,154,6]], "col2":['a','b']})
#                       col1 col2
#0  [12, 34, 12, 34, 45, 56]    a
#1          [12, 14, 154, 6]    b

# df.explode('col1')
#  col1 col2
#0   12    a
#0   34    a
#0   12    a
#0   34    a
#0   45    a
#0   56    a
#1   12    b
#1   14    b
#1  154    b
#1    6    b

new_df = pandas.DataFrame()
for i in range(len(df)):
    df_i = df.iloc[i:i 1].explode('col1')
    new_df = pandas.concat((new_df, df_i)) 

# new_df
#   col1 col2
#0    12    a
#0    34    a
#0    12    a
#0    34    a
#0    45    a
#0    56    a
#1    12    b
#1    14    b
#1   154    b
#1     6    b

To optimize performance, you can step through the dataframe in chunks (e.g. df_chunk = df.iloc[start: start chunk_size]; start = chunk_size; etc)

An alternative approach (probably similar in performance) that avoids explode altogether:

from itertools import product

new_df = pandas.DataFrame()
for i, row in df.iterrows():
    p = product(*row.to_list())
    sub_df = pandas.DataFrame.from_records(p, columns=list(df))
    sub_df.index = pandas.Index([i]*len(sub_df))
    new_df = pandas.concat((new_df, sub_df))

I "think" this should generalize but I did not test it.

CodePudding user response：

A fast and memory-efficient solution is to lean on numpy's .flatten() method. This one line will give you your new dataframe:

df2 = pd.DataFrame({'category_id': np.array([np.array(row) for row in df['category_id']]).flatten()})

To break that down a bit, .flatten() creates a flat array out of nested arrays. But first, you have to convert from list to array, which is where the list comprehension comes in handy. Here's a more self-explanatory version of the same code:

list_of_arrays = [np.array(row) for row in df['category_id']]
array_of_arrays = np.array(list_of_arrays)
flat_array = array_of_arrays.flatten()
df2 = pd.DataFrame({'category_id': flat_array})