Home > database >  Transform a Dictionary of Tuples in a List into a Pandas Dataframe
Transform a Dictionary of Tuples in a List into a Pandas Dataframe

Time:09-21

I have a dictionary of tuples in a list and would like to convert them to a pandas dataframe, but having some hard time with it.

My data is as below:

{0: [('A1', 0.0037505763997138838),
  ('A2', 0.0036963076240675245),
  ('A3', 0.0035451257931104485),
  ('A4', 0.003501467316849233),
  ('A5', 0.00343229837150675),
  ('A6', 0.0033731723637910062),
  ('A7', 0.0033713118048861465),
  ('A8', 0.003325231288305062),
  ('A9', 0.002885164987475754),
  ('A10', 0.0028834984584371797)],
 1: [('B1', 0.011094831353420088),
  ('B2', 0.009526049091086916),
  ('B3', 0.007002935827927014),
  ('B4', 0.00511673700015512),
  ('B5', 0.004870300921667765),
  ('B6', 0.004496108376557714),
  ('B7', 0.004230892962061271),
  ('B8', 0.004137434850455194),
  ('B9', 0.003958335393193675),
  ('B10', 0.0038285145788315993)]}

and I want to transform it into the following in Pandas

num   label   probs
0    A1    0.0037505763997138838
0    A2    0.0036963076240675245
0    A3    0.0035451257931104485
0    A4    0.003501467316849233
0    A5    0.00343229837150675
0    A6    0.0033731723637910062
0    A7    0.0033713118048861465
0    A8    0.003325231288305062
0    A9    0.002885164987475754
0    A10   0.0028834984584371797
1    B1    0.011094831353420088
1    B2    0.009526049091086916
1    B3    0.007002935827927014
1    B4    0.00511673700015512
1    B5    0.004870300921667765
1    B6    0.004496108376557714
1    B7    0.004230892962061271
1    B8    0.004137434850455194
1    B9    0.003958335393193675
1    B10   0.0038285145788315993

CodePudding user response:

You can try:

(Assuming data is the name of the dict:)

df = (pd.Series(data)
        .explode()
        .apply(pd.Series)
        .reset_index()
     )

df.columns = ['num', 'label', 'probs']

Result:

print(df)

    num label     probs
0     0    A1  0.003751
1     0    A2  0.003696
2     0    A3  0.003545
3     0    A4  0.003501
4     0    A5  0.003432
5     0    A6  0.003373
6     0    A7  0.003371
7     0    A8  0.003325
8     0    A9  0.002885
9     0   A10  0.002883
10    1    B1  0.011095
11    1    B2  0.009526
12    1    B3  0.007003
13    1    B4  0.005117
14    1    B5  0.004870
15    1    B6  0.004496
16    1    B7  0.004231
17    1    B8  0.004137
18    1    B9  0.003958
19    1   B10  0.003829

Alternatively, you can also use pd.DataFrame() in place of the 2nd pd.Series() for better performance (thanks for the suggestion by @anky), as follows:

s = pd.Series(data).explode()

df = (pd.DataFrame(s.tolist(),columns=['label', 'probs'], index=s.index)
        .rename_axis(index='num')
        .reset_index()
     )

Result:

print(df)

    num label     probs
0     0    A1  0.003751
1     0    A2  0.003696
2     0    A3  0.003545
3     0    A4  0.003501
4     0    A5  0.003432
5     0    A6  0.003373
6     0    A7  0.003371
7     0    A8  0.003325
8     0    A9  0.002885
9     0   A10  0.002883
10    1    B1  0.011095
11    1    B2  0.009526
12    1    B3  0.007003
13    1    B4  0.005117
14    1    B5  0.004870
15    1    B6  0.004496
16    1    B7  0.004231
17    1    B8  0.004137
18    1    B9  0.003958
19    1   B10  0.003829

CodePudding user response:

We can use comprehension syntax to create a list of triplets (name, label and probs), then you can easily create the dataframe from this list

c = ['name', 'label', 'probs']
pd.DataFrame([(k, *t) for k, v in d.items() for t in v], columns=c)

    name label     probs
0      0    A1  0.003751
1      0    A2  0.003696
2      0    A3  0.003545
3      0    A4  0.003501
4      0    A5  0.003432
5      0    A6  0.003373
6      0    A7  0.003371
7      0    A8  0.003325
8      0    A9  0.002885
9      0   A10  0.002883
10     1    B1  0.011095
11     1    B2  0.009526
12     1    B3  0.007003
13     1    B4  0.005117
14     1    B5  0.004870
15     1    B6  0.004496
16     1    B7  0.004231
17     1    B8  0.004137
18     1    B9  0.003958
19     1   B10  0.003829

CodePudding user response:

You need to rework a bit your dictionary. Here I used itertools.chain to combine the values:

from itertools import chain
import pandas as pd
import numpy as np
df = (pd.DataFrame(list(chain(*d.values())),
                   columns=['label', 'probs'],
                   index=np.repeat(list(d), list(map(len, d.values()))))
        .rename_axis('num')
        .reset_index()
     )

output:

    num label     probs
0     0    A1  0.003751
1     0    A2  0.003696
2     0    A3  0.003545
3     0    A4  0.003501
4     0    A5  0.003432
...
17    1    B8  0.004137
18    1    B9  0.003958
19    1   B10  0.003829
  • Related