So by far whenever I had a dataframe that has a column of list such as the following:
'category_id'
[030000, 010403, 010402, 030604, 234440]
[030000, 010405, 010402, 030604, 033450]
[030000, 010403, 010407, 030604, 030600]
[030000, 010403, 010402, 030609, 032600]
Usually whenever I want to make this category_id column become like:
'category_id'
030000
010403
010402
030604
234440
030000
010405
010402
030604
033450
I would usually use the following code:
df2 = df.explode('category_id')
But whenever my data size gets really big, likes the sales data over the course of an entire month, .explode() becomes extremely slow and I am always worried whether I would encounter any OOM issues related to memory leaks.
Is there any other alternative solutions to .explode that would somehow perform better? I tried to to use flatMap() but I'm stuck on how to exactly turn a dataframe to rdd format and then change it back to dataframe format that I can utilize.
Any info would be appreciated.
CodePudding user response:
It might not be the fastest method, but you can simply explode each row of the pandas frame, and combine:
import pandas
df = pandas.DataFrame({"col1":[[12,34,12,34,45,56], [12,14,154,6]], "col2":['a','b']})
# col1 col2
#0 [12, 34, 12, 34, 45, 56] a
#1 [12, 14, 154, 6] b
# df.explode('col1')
# col1 col2
#0 12 a
#0 34 a
#0 12 a
#0 34 a
#0 45 a
#0 56 a
#1 12 b
#1 14 b
#1 154 b
#1 6 b
new_df = pandas.DataFrame()
for i in range(len(df)):
df_i = df.iloc[i:i 1].explode('col1')
new_df = pandas.concat((new_df, df_i))
# new_df
# col1 col2
#0 12 a
#0 34 a
#0 12 a
#0 34 a
#0 45 a
#0 56 a
#1 12 b
#1 14 b
#1 154 b
#1 6 b
To optimize performance, you can step through the dataframe in chunks (e.g. df_chunk = df.iloc[start: start chunk_size]; start = chunk_size; etc
)
An alternative approach (probably similar in performance) that avoids explode altogether:
from itertools import product
new_df = pandas.DataFrame()
for i, row in df.iterrows():
p = product(*row.to_list())
sub_df = pandas.DataFrame.from_records(p, columns=list(df))
sub_df.index = pandas.Index([i]*len(sub_df))
new_df = pandas.concat((new_df, sub_df))
I "think" this should generalize but I did not test it.
CodePudding user response:
A fast and memory-efficient solution is to lean on numpy's .flatten()
method. This one line will give you your new dataframe:
df2 = pd.DataFrame({'category_id': np.array([np.array(row) for row in df['category_id']]).flatten()})
To break that down a bit, .flatten()
creates a flat array out of nested arrays. But first, you have to convert from list to array, which is where the list comprehension comes in handy. Here's a more self-explanatory version of the same code:
list_of_arrays = [np.array(row) for row in df['category_id']]
array_of_arrays = np.array(list_of_arrays)
flat_array = array_of_arrays.flatten()
df2 = pd.DataFrame({'category_id': flat_array})