Home > Back-end >  Unpacking column of nested tuples of different lengths into multiple columns in pandas dataframe
Unpacking column of nested tuples of different lengths into multiple columns in pandas dataframe

Time:04-22

I have the following dataframe which contains a columns of nested tuples:

index  nested_tuples
1      (('a',(1,0)),('b',(2,0)),('c',(3,0)))
2      (('a',(5,0)),('d',(6,0)),('e',(7,0)),('f',(8,0)))
3      (('c',(4,0)),('d',(5,0)),('g',(6,0)),('h',(7,0)))

I am trying to unpack the tuples to obtain the following dataframe:

index  a   b   c   d   e   f   g   h
1      1   2   3
2      5           6   7   8
3              4   5           6   7

I.e. for each tuple ( char, (num1, num2) ), I would like char to be a column and num1 to be the entry. I initially tried all sorts of methods with to_list() but because the number of mini-tuples and the chars in them are different, I couldn't use that without losing information, and eventually the only solution I could think of was:

for index, row in df.iterrows():
    tuples = row['nested_tuples']
    if not tuples:
        continue
    for mini_tuple in tuples:
        df.loc[index, mini_tuple[0]] = mini_tuple[1][0]

However, with the actual dataframe I have where the nested tuples are long and the df is significantly large, iterrows is incredibly slow. Is there a better vectorised way to do this?

CodePudding user response:

It's probably more efficient to clean the data in vanilla Python before building the DataFrame:

out = pd.DataFrame([{k:v[0] for k,v in tpl} for tpl in df['nested_tuples'].tolist()])

A bit more concisely:

out = pd.DataFrame(map(dict, df['nested_tuples'])).stack().str[0].unstack()

Yet another option using apply:

out = pd.DataFrame(df['nested_tuples'].apply(lambda x: {k:v[0] for k,v in x}).tolist())

Output:

     a    b    c    d    e    f    g    h
0  1.0  2.0  3.0  NaN  NaN  NaN  NaN  NaN
1  5.0  NaN  NaN  6.0  7.0  8.0  NaN  NaN
2  NaN  NaN  4.0  5.0  NaN  NaN  6.0  7.0
  • Related