I have a dictionary that looks like this:
dict = {A: {A: 0,
B: 1,
C: 1,
D: 2,
E: 2,
F: 2,
G: 2,
H: 2,
I: 3},
B: {B: 0,
A: 1,
K: 1,
O: 1,
M: 1,
Q: 1,
L: 1,
Z: 2,
T: 2},
C: {C: 0,
R: 1,
A: 1,
D: 2,
F: 2,
J: 2,
E: 2,
Y: 2,
B: 2},
D: {D: 0,
F: 1,
H: 1,
I: 1,
E: 1,
A: 2,
C: 2,
S: 2,
U: 3}
But in fact it is way bigger (up to 60K keys) and I need a very fast and efficient way to turn this dictionary into a dataframe that looks like this:
person_1 person_2 degree
A A 0
A B 1
A C 1
A D 2
A E 2
A F 2
A G 2
A H 2
A I 3
B B 0
B A 1
B K 1
B O 1
B M 1
B Q 1
B L 1
B Z 2
B T 2
C C 0
C R 1
C A 1
C D 2
C F 2
C J 2
C E 2
C Y 2
C B 2
D D 0
D F 1
D H 1
D I 1
D E 1
D A 2
D C 2
D S 2
D U 3
So basically I need a dataframe where each comes from the dictionary keys and their values, and the third column is the number inside that key. What I'm doing right now is to convert the dictionary to df using df = pd.DataFrame(dict)
and then
df = pd.melt(df, 'index').rename(columns = {'index': 'hcp_npi',
'variable':'connected_hcp_npi',
'value': 'degree_of_separation'}).dropna()
And I get the result I need. But the problem with this approach is that when the dictionary exceeds 20K keys, the melt function just takes forever to run. So I'm looking a faster or more efficient way to go from the initial dictionary to the last dataframe.
Thanks!
CodePudding user response:
It looks like it's faster to pre-process the dictionary into the column values:
from collections import defaultdict
d2 = defaultdict(list)
for k, v in d.items():
d2['person_1'] = [k] * len(v)
d2['person_2'] = list(v.keys())
d2['degree'] = list(v.values())
df = pd.DataFrame(d2)
I tested your method, @jezrael, @BENYs (now deleted) and mine using timeit
and code like this (replacing the stmt
as appropriate):
timeit.timeit(setup='''
import pandas as pd
d = {'A': {'A': 0, 'B': 1, 'C': 1, 'D': 2, 'E': 2, 'F': 2, 'G': 2, 'H': 2, 'I': 3},
'B': {'B': 0, 'A': 1, 'K': 1, 'O': 1, 'M': 1, 'Q': 1, 'L': 1, 'Z': 2, 'T': 2},
'C': {'C': 0, 'R': 1, 'A': 1, 'D': 2, 'F': 2, 'J': 2, 'E': 2, 'Y': 2, 'B': 2},
'D': {'D': 0, 'F': 1, 'H': 1, 'I': 1, 'E': 1, 'A': 2, 'C': 2, 'S': 2, 'U': 3}
}
''',
stmt='''
df = pd.DataFrame(d)
df = pd.melt(df).rename(columns = {'index': 'hcp_npi',
'variable':'connected_hcp_npi',
'value': 'degree_of_separation'}).dropna()
''',
number=1000)
For 1000 iterations, the results were:
Nick 0.2878
jezrael 0.3178
BENY 2.2822
TomasCB 2.2774
For reference, I include @BENY answer here:
pd.concat({x : pd.Series(y) for x , y in d.items()}).reset_index()
CodePudding user response:
Use list comprehension with flatten values for list of tuples nd then pass to DataFrame constructor:
df = pd.DataFrame([(k, k1, v1) for k, v in d.items() for k1, v1 in v.items()],
columns=['person_1','person_2','degree'])
print (df)
person_1 person_2 degree
0 A A 0
1 A B 1
2 A C 1
3 A D 2
4 A E 2
5 A F 2
6 A G 2
7 A H 2
8 A I 3
9 B B 0
10 B A 1
11 B K 1
12 B O 1
13 B M 1
14 B Q 1
15 B L 1
16 B Z 2
17 B T 2
18 C C 0
19 C R 1
20 C A 1
21 C D 2
22 C F 2
23 C J 2
24 C E 2
25 C Y 2
26 C B 2
27 D D 0
28 D F 1
29 D H 1
30 D I 1
31 D E 1
32 D A 2
33 D C 2
34 D S 2
35 D U 3
Performance for 50k unique combinations of outer and inner keys:
np.random.seed(20)
from collections import defaultdict
df = pd.DataFrame({'a': np.random.randint(100, size=50000),
'b': np.random.randint(500000000, size=50000),
'c': np.random.randint(20, size=50000)}).sort_values(['a','b']).drop_duplicates(['a','b'])
df[['a','b']] = 'a' df[['a','b']].astype(str)
s = df.set_index(['a','b'])['c']
print (s)
a b
a0 a87201 6
a1392645 12
a1428929 9
a2453757 8
a2467544 14
..
a99 a496111083 15
a496689150 6
a497489331 3
a499160652 15
a499359212 0
Name: c, Length: 50000, dtype: int32
d = {level: s.xs(level).to_dict() for level in s.index.levels[0]}
In [210]: %%timeit
...: d2 = defaultdict(list)
...: for k, v in d.items():
...: d2['person_1'] = [k] * len(v)
...: d2['person_2'] = list(v.keys())
...: d2['degree'] = list(v.values())
...:
...: pd.DataFrame(d2)
...:
...:
21.2 ms ± 478 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [211]: %%timeit
...: pd.DataFrame([(k, k1, v1) for k, v in d.items() for k1, v1 in v.items()], columns=['person_1','person_2','degree'])
...:
...:
25.8 ms ± 119 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [212]: %%timeit
...: pd.concat({x : pd.Series(y) for x , y in d.items()}).reset_index()
...:
110 ms ± 3.68 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
In [213]: %%timeit
...: df = pd.DataFrame(d)
...: pd.melt(df).rename(columns = {'index': 'hcp_npi', 'variable':'connected_hcp_npi', 'value': 'degree_of_separation'}).dropna()
...:
1.17 s ± 7.81 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)