I have the following dataframe which contains a columns of nested tuples:
index nested_tuples
1 (('a',(1,0)),('b',(2,0)),('c',(3,0)))
2 (('a',(5,0)),('d',(6,0)),('e',(7,0)),('f',(8,0)))
3 (('c',(4,0)),('d',(5,0)),('g',(6,0)),('h',(7,0)))
I am trying to unpack the tuples to obtain the following dataframe:
index a b c d e f g h
1 1 2 3
2 5 6 7 8
3 4 5 6 7
I.e. for each tuple ( char, (num1, num2) ), I would like char to be a column and num1 to be the entry. I initially tried all sorts of methods with to_list()
but because the number of mini-tuples and the chars in them are different, I couldn't use that without losing information, and eventually the only solution I could think of was:
for index, row in df.iterrows():
tuples = row['nested_tuples']
if not tuples:
continue
for mini_tuple in tuples:
df.loc[index, mini_tuple[0]] = mini_tuple[1][0]
However, with the actual dataframe I have where the nested tuples are long and the df is significantly large, iterrows
is incredibly slow. Is there a better vectorised way to do this?
CodePudding user response:
It's probably more efficient to clean the data in vanilla Python before building the DataFrame:
out = pd.DataFrame([{k:v[0] for k,v in tpl} for tpl in df['nested_tuples'].tolist()])
A bit more concisely:
out = pd.DataFrame(map(dict, df['nested_tuples'])).stack().str[0].unstack()
Yet another option using apply
:
out = pd.DataFrame(df['nested_tuples'].apply(lambda x: {k:v[0] for k,v in x}).tolist())
Output:
a b c d e f g h
0 1.0 2.0 3.0 NaN NaN NaN NaN NaN
1 5.0 NaN NaN 6.0 7.0 8.0 NaN NaN
2 NaN NaN 4.0 5.0 NaN NaN 6.0 7.0