Turning column of list of lists (of unequal length) into separate variable columns (python, pandas)-CodePudding

I'm having trouble turning a column of lists of lists into separate columns. I have a bad solution that works by working on each row independently and then appending them to each other, but this takes far too long for ~500k rows. Wondering if someone has a better solution.

Here is the input:

>>> import pandas as pd 
>>> import numpy as np 
>>> pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})

	feat
0	[[str1, , 3], [str3, , 5], [str4, , 3]]
1	[[str1, , 4], [str2, , 5]]

Desired output:

>>> pd.DataFrame({'str1': [3, 4], 'str2': [np.nan,5] , 'str3': [5,np.nan], 'str4': [3,np.nan]})

	str1	str2	str3	str4
0	3	NaN	5	3
1	4	5	NaN	NaN

Update: Solved by @ifly6! Fastest solution by far. For 100k rows and 80 total variables, the total time taken was 8.9 seconds for my machine.

CodePudding user response：

Loading your df, create df1 as follows:

df1 = pd.DataFrame.from_records(df.explode('feat').values.flatten()).replace('', np.nan)

Set index on df1 from the original data to preserve row markers (passing index=df.explode('feat').index does not work).

df1.index = df.explode('feat').index

(Alternatively, to get to the point where you have separated the lists into columns, you could use df.explode('feat')['feat'].apply(pd.Series). I prefer, however, to avoid apply so use the DataFrame constructor instead.)

Reset index on df1 then set multi-index (cannot set the column 0 index directly because it overwrites the original index):

df1.reset_index().set_index(['index', 0])

Then unstack. You can drop columns that are all NaN by appending .dropna(how='all', axis=1), yielding:

>>> df1.reset_index().set_index(['index', 0]).unstack().dropna(how='all', axis=1)
         2               
0     str1 str2 str3 str4
index                    
0      3.0  NaN  5.0  3.0
1      4.0  5.0  NaN  NaN

CodePudding user response：

here is one way to do it

# explode the list to rows

df=df.explode('feat')

# remove the [] from the list, and split on ","
df[['col1','col3','col2']]=df['feat'].astype('str').replace('[\[\]]','', regex=True).str.split(',', expand=True)

# use pivot after reindexing
df=df.reset_index()
df.pivot(index='index', columns='col1', values='col2')
df

col1    'str1'  'str2'  'str3'  'str4'
index               
0         3       NaN      5      3
1         4         5    NaN    NaN

CodePudding user response：

Convert your nested lists to dictionaries that pd.Series can interpret:

df = df.feat.apply(lambda val: pd.Series({y[0]:y[2] for y in val}))
df = df[df.columns.sort_values()]
print(df)

Output:

   str1  str2  str3  str4
0   3.0   NaN   5.0   3.0
1   4.0   5.0   NaN   NaN

CodePudding user response：

My solution is a brute force approach building the new df1 cell by cell using df1.loc[i, col_name].

import pandas as pd

df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame()
for i in range(df.shape[0]):
    for e in df.loc[i, 'feat']:
        df1.loc[i, e[0]] = e[2]
print(df1)

Output (not in column order):

   str1  str3  str4  str2
0   3.0   5.0   3.0   NaN
1   4.0   NaN   NaN   5.0

And the time taken is

import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame()
for i in range(df.shape[0]):
    for e in df.loc[i, 'feat']:
        df1.loc[i, e[0]] = e[2]
''', number=10000)

19.209370899999996

So it took about 20 seconds for 10K runs. I'm curious to know how the other algorithms perform. Please also run it on your own because time taken varies for different computers. And also varies with different dataset. Here they are:

#Answer from @ifly6

import timeit
timeit.timeit('''
import pandas as pd
import numpy as np
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame.from_records(df.explode('feat').values.flatten()).replace('', np.nan)
df1.index = df.explode('feat').index
df1 = df1.reset_index().set_index(['index', 0]).unstack().dropna(how='all', axis=1)
''', number=10000)

48.217678400000295

#Answer from @Naveed

import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df = df.explode('feat')
df[['col1','col3','col2']] = df['feat'].astype('str').replace('[\[\]]','', regex=True).str.split(',', expand=True)
df = df.reset_index()
df = df.pivot(index='index', columns='col1', values='col2')
''', number=10000)

34.94540550000056

#Answer from @BeRT2me (it's even faster without rearranging the columns with df = df[df.columns.sort_values()])

import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df = df.feat.apply(lambda val: pd.Series({y[0]:y[2] for y in val}))
df = df[df.columns.sort_values()]
''', number=10000)

12.745890199999849