I'm having trouble turning a column of lists of lists into separate columns. I have a bad solution that works by working on each row independently and then appending them to each other, but this takes far too long for ~500k rows. Wondering if someone has a better solution.
Here is the input:
>>> import pandas as pd
>>> import numpy as np
>>> pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
feat | |
---|---|
0 | [[str1, , 3], [str3, , 5], [str4, , 3]] |
1 | [[str1, , 4], [str2, , 5]] |
Desired output:
>>> pd.DataFrame({'str1': [3, 4], 'str2': [np.nan,5] , 'str3': [5,np.nan], 'str4': [3,np.nan]})
str1 | str2 | str3 | str4 | |
---|---|---|---|---|
0 | 3 | NaN | 5 | 3 |
1 | 4 | 5 | NaN | NaN |
Update: Solved by @ifly6! Fastest solution by far. For 100k rows and 80 total variables, the total time taken was 8.9 seconds for my machine.
CodePudding user response:
Loading your df
, create df1
as follows:
df1 = pd.DataFrame.from_records(df.explode('feat').values.flatten()).replace('', np.nan)
Set index on df1
from the original data to preserve row markers (passing index=df.explode('feat').index
does not work).
df1.index = df.explode('feat').index
(Alternatively, to get to the point where you have separated the lists into columns, you could use df.explode('feat')['feat'].apply(pd.Series)
. I prefer, however, to avoid apply
so use the DataFrame constructor instead.)
Reset index on df1
then set multi-index (cannot set the column 0 index directly because it overwrites the original index):
df1.reset_index().set_index(['index', 0])
Then unstack. You can drop columns that are all NaN
by appending .dropna(how='all', axis=1)
, yielding:
>>> df1.reset_index().set_index(['index', 0]).unstack().dropna(how='all', axis=1)
2
0 str1 str2 str3 str4
index
0 3.0 NaN 5.0 3.0
1 4.0 5.0 NaN NaN
CodePudding user response:
here is one way to do it
# explode the list to rows
df=df.explode('feat')
# remove the [] from the list, and split on ","
df[['col1','col3','col2']]=df['feat'].astype('str').replace('[\[\]]','', regex=True).str.split(',', expand=True)
# use pivot after reindexing
df=df.reset_index()
df.pivot(index='index', columns='col1', values='col2')
df
col1 'str1' 'str2' 'str3' 'str4'
index
0 3 NaN 5 3
1 4 5 NaN NaN
CodePudding user response:
Convert your nested lists to dictionaries that pd.Series
can interpret:
df = df.feat.apply(lambda val: pd.Series({y[0]:y[2] for y in val}))
df = df[df.columns.sort_values()]
print(df)
Output:
str1 str2 str3 str4
0 3.0 NaN 5.0 3.0
1 4.0 5.0 NaN NaN
CodePudding user response:
My solution is a brute force approach building the new df1
cell by cell using df1.loc[i, col_name]
.
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame()
for i in range(df.shape[0]):
for e in df.loc[i, 'feat']:
df1.loc[i, e[0]] = e[2]
print(df1)
Output (not in column order):
str1 str3 str4 str2
0 3.0 5.0 3.0 NaN
1 4.0 NaN NaN 5.0
And the time taken is
import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame()
for i in range(df.shape[0]):
for e in df.loc[i, 'feat']:
df1.loc[i, e[0]] = e[2]
''', number=10000)
19.209370899999996
So it took about 20 seconds for 10K runs. I'm curious to know how the other algorithms perform. Please also run it on your own because time taken varies for different computers. And also varies with different dataset. Here they are:
#Answer from @ifly6
import timeit
timeit.timeit('''
import pandas as pd
import numpy as np
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df1 = pd.DataFrame.from_records(df.explode('feat').values.flatten()).replace('', np.nan)
df1.index = df.explode('feat').index
df1 = df1.reset_index().set_index(['index', 0]).unstack().dropna(how='all', axis=1)
''', number=10000)
48.217678400000295
#Answer from @Naveed
import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df = df.explode('feat')
df[['col1','col3','col2']] = df['feat'].astype('str').replace('[\[\]]','', regex=True).str.split(',', expand=True)
df = df.reset_index()
df = df.pivot(index='index', columns='col1', values='col2')
''', number=10000)
34.94540550000056
#Answer from @BeRT2me (it's even faster without rearranging the columns with df = df[df.columns.sort_values()]
)
import timeit
timeit.timeit('''
import pandas as pd
df= pd.DataFrame({'feat': [[["str1","", 3], ["str3","", 5], ["str4","", 3]],[["str1","", 4], ["str2","", 5]] ]})
df = df.feat.apply(lambda val: pd.Series({y[0]:y[2] for y in val}))
df = df[df.columns.sort_values()]
''', number=10000)
12.745890199999849