Fastest way to convert huge dictionary to a dataframe-CodePudding

I have a dictionary that looks like this:

dict = {A: {A: 0,
  B: 1,
  C: 1,
  D: 2,
  E: 2,
  F: 2,
  G: 2,
  H: 2,
  I: 3},
 B: {B: 0,
  A: 1,
  K: 1,
  O: 1,
  M: 1,
  Q: 1,
  L: 1,
  Z: 2,
  T: 2},
 C: {C: 0,
  R: 1,
  A: 1,
  D: 2,
  F: 2,
  J: 2,
  E: 2,
  Y: 2,
  B: 2},
 D: {D: 0,
  F: 1,
  H: 1,
  I: 1,
  E: 1,
  A: 2,
  C: 2,
  S: 2,
  U: 3}

But in fact it is way bigger (up to 60K keys) and I need a very fast and efficient way to turn this dictionary into a dataframe that looks like this:

person_1     person_2    degree
   A            A          0
   A            B          1 
   A            C          1
   A            D          2
   A            E          2 
   A            F          2
   A            G          2
   A            H          2 
   A            I          3
   B            B          0
   B            A          1 
   B            K          1 
   B            O          1
   B            M          1 
   B            Q          1 
   B            L          1
   B            Z          2 
   B            T          2 
   C            C          0
   C            R          1 
   C            A          1 
   C            D          2
   C            F          2 
   C            J          2 
   C            E          2
   C            Y          2 
   C            B          2
   D            D          0
   D            F          1 
   D            H          1 
   D            I          1
   D            E          1 
   D            A          2 
   D            C          2
   D            S          2 
   D            U          3

So basically I need a dataframe where each comes from the dictionary keys and their values, and the third column is the number inside that key. What I'm doing right now is to convert the dictionary to df using df = pd.DataFrame(dict) and then

df = pd.melt(df, 'index').rename(columns = {'index': 'hcp_npi', 
                                            'variable':'connected_hcp_npi', 
                                            'value': 'degree_of_separation'}).dropna()

And I get the result I need. But the problem with this approach is that when the dictionary exceeds 20K keys, the melt function just takes forever to run. So I'm looking a faster or more efficient way to go from the initial dictionary to the last dataframe.

Thanks!

CodePudding user response：

It looks like it's faster to pre-process the dictionary into the column values:

from collections import defaultdict

d2 = defaultdict(list)
for k, v in d.items():
    d2['person_1']  = [k] * len(v)
    d2['person_2']  = list(v.keys())
    d2['degree']  = list(v.values())

df = pd.DataFrame(d2)

I tested your method, @jezrael, @BENYs (now deleted) and mine using timeit and code like this (replacing the stmt as appropriate):

timeit.timeit(setup='''
import pandas as pd
d = {'A': {'A': 0, 'B': 1, 'C': 1, 'D': 2, 'E': 2, 'F': 2, 'G': 2, 'H': 2, 'I': 3},
 'B': {'B': 0, 'A': 1, 'K': 1, 'O': 1, 'M': 1, 'Q': 1, 'L': 1, 'Z': 2, 'T': 2},
 'C': {'C': 0, 'R': 1, 'A': 1, 'D': 2, 'F': 2, 'J': 2, 'E': 2, 'Y': 2, 'B': 2},
 'D': {'D': 0, 'F': 1, 'H': 1, 'I': 1, 'E': 1, 'A': 2, 'C': 2, 'S': 2, 'U': 3}
 }
''',
stmt='''
df = pd.DataFrame(d)
df = pd.melt(df).rename(columns = {'index': 'hcp_npi', 
                                            'variable':'connected_hcp_npi', 
                                            'value': 'degree_of_separation'}).dropna()
''',
number=1000)

For 1000 iterations, the results were:

Nick      0.2878
jezrael   0.3178
BENY      2.2822
TomasCB   2.2774

For reference, I include @BENY answer here:

pd.concat({x : pd.Series(y) for x , y in d.items()}).reset_index()

CodePudding user response：

Use list comprehension with flatten values for list of tuples nd then pass to DataFrame constructor:

df = pd.DataFrame([(k, k1, v1) for k, v in d.items() for k1, v1 in v.items()], 
                  columns=['person_1','person_2','degree'])

print (df)
   person_1 person_2  degree
0         A        A       0
1         A        B       1
2         A        C       1
3         A        D       2
4         A        E       2
5         A        F       2
6         A        G       2
7         A        H       2
8         A        I       3
9         B        B       0
10        B        A       1
11        B        K       1
12        B        O       1
13        B        M       1
14        B        Q       1
15        B        L       1
16        B        Z       2
17        B        T       2
18        C        C       0
19        C        R       1
20        C        A       1
21        C        D       2
22        C        F       2
23        C        J       2
24        C        E       2
25        C        Y       2
26        C        B       2
27        D        D       0
28        D        F       1
29        D        H       1
30        D        I       1
31        D        E       1
32        D        A       2
33        D        C       2
34        D        S       2
35        D        U       3

Performance for 50k unique combinations of outer and inner keys:

np.random.seed(20)


from collections import defaultdict


df = pd.DataFrame({'a': np.random.randint(100, size=50000),
                  'b': np.random.randint(500000000, size=50000),
                  'c': np.random.randint(20, size=50000)}).sort_values(['a','b']).drop_duplicates(['a','b'])

df[['a','b']] = 'a'   df[['a','b']].astype(str)

s = df.set_index(['a','b'])['c']
print (s)
a    b         
a0   a87201         6
     a1392645      12
     a1428929       9
     a2453757       8
     a2467544      14
                   ..
a99  a496111083    15
     a496689150     6
     a497489331     3
     a499160652    15
     a499359212     0
Name: c, Length: 50000, dtype: int32

d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}


In [210]: %%timeit
     ...: d2 = defaultdict(list)
     ...: for k, v in d.items():
     ...:     d2['person_1']  = [k] * len(v)
     ...:     d2['person_2']  = list(v.keys())
     ...:     d2['degree']  = list(v.values())
     ...: 
     ...: pd.DataFrame(d2)
     ...: 
     ...: 
21.2 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [211]: %%timeit
     ...: pd.DataFrame([(k, k1, v1) for k, v in d.items() for k1, v1 in v.items()], columns=['person_1','person_2','degree'])
     ...: 
     ...: 
25.8 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [212]: %%timeit
     ...: pd.concat({x : pd.Series(y) for x , y in d.items()}).reset_index()
     ...: 
110 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

In [213]: %%timeit
     ...: df = pd.DataFrame(d)
     ...: pd.melt(df).rename(columns = {'index': 'hcp_npi',  'variable':'connected_hcp_npi', 'value': 'degree_of_separation'}).dropna()
     ...: 
1.17 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)