Fastest way to map a dict on a df (multiple columns)-CodePudding

I have the following data frame:

df = pd.DataFrame({0: {0: 'EFG',
  1: 'EFG',
  2: 'EFG',
  3: 'EFG',
  4: 'EFG',
  5: 'EFG',
  6: 'EFG',
  7: 'ABC',
  8: 'EFG',
  9: 'EFG',
  10: 'EFG',
  11: 'EFG',
  12: 'EFG',
  13: 'EFG',
  14: 'EFG',
  15: 'EFG',
  16: 'EFG',
  17: 'EFG',
  18: 'EFG',
  19: 'EFG',
  20: 'ABC',
  21: 'EFG',
  22: 'EFG',
  23: 'EFG',
  24: 'EFG',
  25: 'EFG',
  26: 'EFG',
  27: 'EFG',
  28: 'EFG',
  29: 'EFG'},
 1: {0: 'DS',
  1: 'DS',
  2: 'DS',
  3: 'Q',
  4: 'DS',
  5: 'DS',
  6: 'DS',
  7: 'DS',
  8: 'DS',
  9: 'DS',
  10: 'DS',
  11: 'DS',
  12: 'DS',
  13: 'DS',
  14: 'DS',
  15: 'DS',
  16: 'DS',
  17: 'DS',
  18: 'DS',
  19: 'DS',
  20: 'DS',
  21: 'DS',
  22: 'DAS',
  23: 'DAS',
  24: 'DAS',
  25: 'DS',
  26: 'DS',
  27: 'Q',
  28: 'DS',
  29: 'DS'},
 2: {0: '321',
  1: '900',
  2: '900',
  3: '900',
  4: '1000',
  5: '1000',
  6: '1000',
  7: '444',
  8: '900',
  9: '900',
  10: '321',
  11: '900',
  12: '1000',
  13: '900',
  14: '321',
  15: '321',
  16: '1000',
  17: '1000',
  18: '1000',
  19: '1000',
  20: '444',
  21: '900',
  22: '12345',
  23: '12345',
  24: '321',
  25: '321',
  26: '12345',
  27: '1000',
  28: '900',
  29: '321'}})

and the following dict:

{('ABC', 'AS', '1000'): 123,
 ('ABC', 'AS', '444'): 321,
 ('ABC', 'AS', '231341'): 421,
 ('ABC', 'AS', '888'): 412,
 ('ABC', 'AS', '087'): 4215,
 ('ABC', 'DAS', '1000'): 3415,
 ('ABC', 'DAS', '444'): 4215,
 ('ABC', 'DAS', '231341'): 3214,
 ('ABC', 'DAS', '888'): 321,
 ('ABC', 'DAS', '087'): 111,
 ('ABC', 'Q', '1000'): 222,
 ('ABC', 'Q', '444'): 3214,
 ('ABC', 'Q', '231341'): 421,
 ('ABC', 'Q', '888'): 321,
 ('ABC', 'Q', '087'): 41,
 ('ABC', 'DS', '1000'): 421,
 ('ABC', 'DS', '444'): 421,
 ('ABC', 'DS', '231341'): 321,
 ('ABC', 'DS', '888'): 41,
 ('ABC', 'DS', '087'): 41,
 ('EFG', 'AS', '1000'): 213,
 ('EFG', 'AS', '900'): 32,
 ('EFG', 'AS', '12345'): 1,
 ('EFG', 'AS', '321'): 3,
 ('EFG', 'DAS', '1000'): 421,
 ('EFG', 'DAS', '900'): 321,
 ('EFG', 'DAS', '12345'): 123,
 ('EFG', 'DAS', '321'): 31,
 ('EFG', 'Q', '1000'): 41,
 ('EFG', 'Q', '900'): 51,
 ('EFG', 'Q', '12345'): 321,
 ('EFG', 'Q', '321'): 321,
 ('EFG', 'DS', '1000'): 41,
 ('EFG', 'DS', '900'): 51,
 ('EFG', 'DS', '12345'): 321,
 ('EFG', 'DS', '321'): 1}

This is of course only a sample of my df, and I have multiple dicts like this one. I am looking for the fastest way to map this dict to this dataframe, based on the 3 columns. I need to run this multiple times during ,y analysis, so I am looking for the optimal solution in term of running time.

what i tried already:

def map_d(a,b,c):
    return d1[(a,b,c)]
res = [map_d(*a) for a in tuple(zip(df[0], df[1], df[2]))]
23.1 µs ± 335 ns per loop (mean ± std. dev. of 7 runs, 10000 loops each)

performance on real data:
    170 ms ± 5.47 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)



res = df.apply(lambda x: d1[(x[0],x[1],x[2])],axis=1)
742 µs ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)
performance on real data:
    7.27 s ± 201 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

I am looking for fastest solutions (I can build the dict differently if needed) Thanks

CodePudding user response：

For improve performance in larger DataFrames is possible use Series with DataFrame.join:

df = pd.DataFrame(df)

df1 = df.join(pd.Series(d, name='new'), on=[0,1,2])
print (df1.head(10))
     0   1     2  new
0  EFG  DS   321    1
1  EFG  DS   900   51
2  EFG  DS   900   51
3  EFG   Q   900   51
4  EFG  DS  1000   41
5  EFG  DS  1000   41
6  EFG  DS  1000   41
7  ABC  DS   444  421
8  EFG  DS   900   51
9  EFG  DS   900   51

Another idea (similar solution like in question):

 res = [d[(a,b,c)] for a,b,c in zip(df[0], df[1], df[2])]

Or:

 res = [d[(a,b,c)] for a,b,c in df[[0,1,2]].to_numpy()]

CodePudding user response：

Convert your dataframe as MultiIndex and extract values from your dict (Series):

out = pd.Series(d).loc[pd.MultiIndex.from_frame(df)] \
        .rename('values').reset_index()

Output:

>>> out
      0    1      2  values
0   EFG   DS    321       1
1   EFG   DS    900      51
2   EFG   DS    900      51
3   EFG    Q    900      51
4   EFG   DS   1000      41
5   EFG   DS   1000      41
6   EFG   DS   1000      41
7   ABC   DS    444     421
8   EFG   DS    900      51
9   EFG   DS    900      51
10  EFG   DS    321       1
11  EFG   DS    900      51
12  EFG   DS   1000      41
13  EFG   DS    900      51
14  EFG   DS    321       1
15  EFG   DS    321       1
16  EFG   DS   1000      41
17  EFG   DS   1000      41
18  EFG   DS   1000      41
19  EFG   DS   1000      41
20  ABC   DS    444     421
21  EFG   DS    900      51
22  EFG  DAS  12345     123
23  EFG  DAS  12345     123
24  EFG  DAS    321      31
25  EFG   DS    321       1
26  EFG   DS  12345     321
27  EFG    Q   1000      41
28  EFG   DS    900      51
29  EFG   DS    321       1

CodePudding user response：

You can apply tuple constructor for each row and then map the dictionary. Note that your df is not a DataFrame. It needs to be cast into a DataFrame constructor.

df = pd.DataFrame(df)
df['vals'] = df.apply(tuple, axis=1).map(dct)

Output:

      0    1      2  vals
0   EFG   DS    321     1
1   EFG   DS    900    51
2   EFG   DS    900    51
3   EFG    Q    900    51
4   EFG   DS   1000    41
...
25  EFG   DS    321     1
26  EFG   DS  12345   321
27  EFG    Q   1000    41
28  EFG   DS    900    51
29  EFG   DS    321     1