Home > Software engineering >  Fill NaN in second level of multi indexed pandas data frame
Fill NaN in second level of multi indexed pandas data frame

Time:11-20

I do have a multiindexed pandas data frame with sensor data like this:

                        high1       low1       high2        low2         offset
timestamp   channel                 
2021-01-01  A        966.6100   965.0300    967.7900    965.0300      27.307721
            B          1.4105     1.3900      1.4105      1.3900    2078.353670
2021-01-02  A        965.0300   966.4700    966.4800    965.0000      35.402437
            B          1.3900     1.3890      1.4028      1.3890     726.717821
2021-01-03  A        966.4600   966.0100    967.6800    965.4200      19.896296
            B             NaN        NaN         NaN         NaN            NaN
2021-01-04  A        966.6300   967.0000    967.0000    966.0300      12.958161
            B          1.4139     1.4140      1.4140      1.4139     692.804577
2021-01-05  A        967.0000   967.2000    967.2000    967.0000      10.345234
            B             NaN        NaN         NaN         NaN            NaN
2021-01-06  A        967.2000   967.0000    967.2500    967.0000       7.026761
            B          1.4140     1.4182      1.4182      1.4140     604.725766

Now I would like to replace the NaN with the preceding data point in the same column for that index (A, B). I know about pandas.fillna (method='ffill'), but I do not understand the concept of accessing and setting the matching sub table.

  • df.xs ('B', level='channel') does return the correct data, but as a copy and assignment of the corrected data does not seem to be possible that way.
  • df.loc[('01/01/2021',)] can be used to return a reference to a sub table, but this does not work for the second level index, like df.loc[(,'B')].

Is there a way to achieve this is a pandas-like way ?

Thank you in advance !

CodePudding user response:

The short answer is that you are probably looking for

df.loc[(slice(None), 'B'), :] = df.loc[(slice(None), 'B'), :].fillna(method='ffill')

The long answer is as follows.

In a lot of cases when Pandas returns a copy of the original dataset it is possible to "write it back" using the same indexer. df.loc-indexing on hierarchical indexes is done with tuples, e.g.

df.loc[(first_level_slice, second_level_slice, ...), :]

Although it is not possible to use the df.loc[:] shorthand notation inside a tuple, all uses of colon : can be replaced with slice(None):

df.loc[(:, 'B'), :]            # bad: syntax error
df.loc[(slice(None), 'B'), :]  # good

Of course, it requires knowing and remembering the slice(None) notation. There's an alternative notation made possible by pd.IndexSlice helper which converts the shorthand notations into Python objects:

>>> pd.IndexSlice[:, 'B'] == (slice(None), 'B')
True

And the documentation very conveniently recommends introducing a shorter alias, so that the short answer can be made even shorter:

idx = pd.IndexSlice
df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')

And just to verify that it works, let's try it:

In [1]: data = [
   ...: {"timestamp": "2021-01-01", "channel": "A", "high1": 966.6100, "low1": 965.0300, "high2": 967.7900, "low2": 965.0300, "offset": 27.307721},
   ...: {"timestamp": "2021-01-01", "channel": "B", "high1": 1.4105, "low1": 1.3900, "high2": 1.4105, "low2": 1.3900, "offset": 2078.353670},
   ...: {"timestamp": "2021-01-02", "channel": "A", "high1": 965.0300, "low1": 966.4700, "high2": 966.4800, "low2": 965.0000, "offset": 35.402437},
   ...: {"timestamp": "2021-01-02", "channel": "B", "high1": 1.3900, "low1": 1.3890, "high2": 1.4028, "low2": 1.3890, "offset": 726.717821},
   ...: {"timestamp": "2021-01-03", "channel": "A", "high1": 966.4600, "low1": 966.0100, "high2": 967.6800, "low2": 965.4200, "offset": 19.896296},
   ...: {"timestamp": "2021-01-03", "channel": "B"},
   ...: {"timestamp": "2021-01-04", "channel": "A", "high1": 966.6300, "low1": 967.0000, "high2": 967.0000, "low2": 966.0300, "offset": 12.958161},
   ...: {"timestamp": "2021-01-04", "channel": "B", "high1": 1.4139, "low1": 1.4140, "high2": 1.4140, "low2": 1.4139, "offset": 692.804577},
   ...: {"timestamp": "2021-01-05", "channel": "A", "high1": 967.0000, "low1": 967.2000, "high2": 967.2000, "low2": 967.0000, "offset": 10.345234},
   ...: {"timestamp": "2021-01-05", "channel": "B"},
   ...: {"timestamp": "2021-01-06", "channel": "A", "high1": 967.2000, "low1": 967.0000, "high2": 967.2500, "low2": 967.0000, "offset": 7.026761},
   ...: {"timestamp": "2021-01-06", "channel": "B", "high1": 1.4140, "low1": 1.4182, "high2": 1.4182, "low2": 1.4140, "offset": 604.725766},
   ...: ]

In [2]: import pandas as pd

In [3]: df = pd.DataFrame.from_records(data).set_index(keys=['timestamp', 'channel'])

In [4]: df
Out[4]: 
                       high1      low1     high2      low2       offset
timestamp  channel                                                     
2021-01-01 A        966.6100  965.0300  967.7900  965.0300    27.307721
           B          1.4105    1.3900    1.4105    1.3900  2078.353670
2021-01-02 A        965.0300  966.4700  966.4800  965.0000    35.402437
           B          1.3900    1.3890    1.4028    1.3890   726.717821
2021-01-03 A        966.4600  966.0100  967.6800  965.4200    19.896296
           B             NaN       NaN       NaN       NaN          NaN
2021-01-04 A        966.6300  967.0000  967.0000  966.0300    12.958161
           B          1.4139    1.4140    1.4140    1.4139   692.804577
2021-01-05 A        967.0000  967.2000  967.2000  967.0000    10.345234
           B             NaN       NaN       NaN       NaN          NaN
2021-01-06 A        967.2000  967.0000  967.2500  967.0000     7.026761
           B          1.4140    1.4182    1.4182    1.4140   604.725766

In [4]: df.loc[(slice(None), 'B'), :]
Out[4]: 
                     high1    low1   high2    low2       offset
timestamp  channel                                             
2021-01-01 B        1.4105  1.3900  1.4105  1.3900  2078.353670
2021-01-02 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-03 B           NaN     NaN     NaN     NaN          NaN
2021-01-04 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-05 B           NaN     NaN     NaN     NaN          NaN
2021-01-06 B        1.4140  1.4182  1.4182  1.4140   604.725766

In [5]: idx = pd.IndexSlice

In [6]: df.loc[idx[:, 'B'], :]
Out[6]: 
                     high1    low1   high2    low2       offset
timestamp  channel                                             
2021-01-01 B        1.4105  1.3900  1.4105  1.3900  2078.353670
2021-01-02 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-03 B           NaN     NaN     NaN     NaN          NaN
2021-01-04 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-05 B           NaN     NaN     NaN     NaN          NaN
2021-01-06 B        1.4140  1.4182  1.4182  1.4140   604.725766

In [7]: df.loc[idx[:, 'B'], :].fillna(method='ffill')
Out[7]: 
                     high1    low1   high2    low2       offset
timestamp  channel                                             
2021-01-01 B        1.4105  1.3900  1.4105  1.3900  2078.353670
2021-01-02 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-03 B        1.3900  1.3890  1.4028  1.3890   726.717821
2021-01-04 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-05 B        1.4139  1.4140  1.4140  1.4139   692.804577
2021-01-06 B        1.4140  1.4182  1.4182  1.4140   604.725766

In [8]: df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')

In [9]: df
Out[9]: 
                       high1      low1     high2      low2       offset
timestamp  channel                                                     
2021-01-01 A        966.6100  965.0300  967.7900  965.0300    27.307721
           B          1.4105    1.3900    1.4105    1.3900  2078.353670
2021-01-02 A        965.0300  966.4700  966.4800  965.0000    35.402437
           B          1.3900    1.3890    1.4028    1.3890   726.717821
2021-01-03 A        966.4600  966.0100  967.6800  965.4200    19.896296
           B          1.3900    1.3890    1.4028    1.3890   726.717821
2021-01-04 A        966.6300  967.0000  967.0000  966.0300    12.958161
           B          1.4139    1.4140    1.4140    1.4139   692.804577
2021-01-05 A        967.0000  967.2000  967.2000  967.0000    10.345234
           B          1.4139    1.4140    1.4140    1.4139   692.804577
2021-01-06 A        967.2000  967.0000  967.2500  967.0000     7.026761
           B          1.4140    1.4182    1.4182    1.4140   604.725766

  • Related