I do have a multiindexed pandas data frame with sensor data like this:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 A 966.6100 965.0300 967.7900 965.0300 27.307721
B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 A 965.0300 966.4700 966.4800 965.0000 35.402437
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 A 966.4600 966.0100 967.6800 965.4200 19.896296
B NaN NaN NaN NaN NaN
2021-01-04 A 966.6300 967.0000 967.0000 966.0300 12.958161
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 A 967.0000 967.2000 967.2000 967.0000 10.345234
B NaN NaN NaN NaN NaN
2021-01-06 A 967.2000 967.0000 967.2500 967.0000 7.026761
B 1.4140 1.4182 1.4182 1.4140 604.725766
Now I would like to replace the NaN
with the preceding data point in the same column for that index (A, B). I know about pandas.fillna (method='ffill')
, but I do not understand the concept of accessing and setting the matching sub table.
df.xs ('B', level='channel')
does return the correct data, but as a copy and assignment of the corrected data does not seem to be possible that way.df.loc[('01/01/2021',)]
can be used to return a reference to a sub table, but this does not work for the second level index, likedf.loc[(,'B')]
.
Is there a way to achieve this is a pandas-like way ?
Thank you in advance !
CodePudding user response:
The short answer is that you are probably looking for
df.loc[(slice(None), 'B'), :] = df.loc[(slice(None), 'B'), :].fillna(method='ffill')
The long answer is as follows.
In a lot of cases when Pandas returns a copy of the original dataset it is possible to "write it back" using the same indexer. df.loc
-indexing on hierarchical indexes is done with tuples, e.g.
df.loc[(first_level_slice, second_level_slice, ...), :]
Although it is not possible to use the df.loc[:]
shorthand notation inside a tuple, all uses of colon :
can be replaced with slice(None)
:
df.loc[(:, 'B'), :] # bad: syntax error
df.loc[(slice(None), 'B'), :] # good
Of course, it requires knowing and remembering the slice(None)
notation. There's an alternative notation made possible by pd.IndexSlice
helper which converts the shorthand notations into Python objects:
>>> pd.IndexSlice[:, 'B'] == (slice(None), 'B')
True
And the documentation very conveniently recommends introducing a shorter alias, so that the short answer can be made even shorter:
idx = pd.IndexSlice
df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')
And just to verify that it works, let's try it:
In [1]: data = [
...: {"timestamp": "2021-01-01", "channel": "A", "high1": 966.6100, "low1": 965.0300, "high2": 967.7900, "low2": 965.0300, "offset": 27.307721},
...: {"timestamp": "2021-01-01", "channel": "B", "high1": 1.4105, "low1": 1.3900, "high2": 1.4105, "low2": 1.3900, "offset": 2078.353670},
...: {"timestamp": "2021-01-02", "channel": "A", "high1": 965.0300, "low1": 966.4700, "high2": 966.4800, "low2": 965.0000, "offset": 35.402437},
...: {"timestamp": "2021-01-02", "channel": "B", "high1": 1.3900, "low1": 1.3890, "high2": 1.4028, "low2": 1.3890, "offset": 726.717821},
...: {"timestamp": "2021-01-03", "channel": "A", "high1": 966.4600, "low1": 966.0100, "high2": 967.6800, "low2": 965.4200, "offset": 19.896296},
...: {"timestamp": "2021-01-03", "channel": "B"},
...: {"timestamp": "2021-01-04", "channel": "A", "high1": 966.6300, "low1": 967.0000, "high2": 967.0000, "low2": 966.0300, "offset": 12.958161},
...: {"timestamp": "2021-01-04", "channel": "B", "high1": 1.4139, "low1": 1.4140, "high2": 1.4140, "low2": 1.4139, "offset": 692.804577},
...: {"timestamp": "2021-01-05", "channel": "A", "high1": 967.0000, "low1": 967.2000, "high2": 967.2000, "low2": 967.0000, "offset": 10.345234},
...: {"timestamp": "2021-01-05", "channel": "B"},
...: {"timestamp": "2021-01-06", "channel": "A", "high1": 967.2000, "low1": 967.0000, "high2": 967.2500, "low2": 967.0000, "offset": 7.026761},
...: {"timestamp": "2021-01-06", "channel": "B", "high1": 1.4140, "low1": 1.4182, "high2": 1.4182, "low2": 1.4140, "offset": 604.725766},
...: ]
In [2]: import pandas as pd
In [3]: df = pd.DataFrame.from_records(data).set_index(keys=['timestamp', 'channel'])
In [4]: df
Out[4]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 A 966.6100 965.0300 967.7900 965.0300 27.307721
B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 A 965.0300 966.4700 966.4800 965.0000 35.402437
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 A 966.4600 966.0100 967.6800 965.4200 19.896296
B NaN NaN NaN NaN NaN
2021-01-04 A 966.6300 967.0000 967.0000 966.0300 12.958161
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 A 967.0000 967.2000 967.2000 967.0000 10.345234
B NaN NaN NaN NaN NaN
2021-01-06 A 967.2000 967.0000 967.2500 967.0000 7.026761
B 1.4140 1.4182 1.4182 1.4140 604.725766
In [4]: df.loc[(slice(None), 'B'), :]
Out[4]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 B NaN NaN NaN NaN NaN
2021-01-04 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 B NaN NaN NaN NaN NaN
2021-01-06 B 1.4140 1.4182 1.4182 1.4140 604.725766
In [5]: idx = pd.IndexSlice
In [6]: df.loc[idx[:, 'B'], :]
Out[6]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 B NaN NaN NaN NaN NaN
2021-01-04 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 B NaN NaN NaN NaN NaN
2021-01-06 B 1.4140 1.4182 1.4182 1.4140 604.725766
In [7]: df.loc[idx[:, 'B'], :].fillna(method='ffill')
Out[7]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-04 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-06 B 1.4140 1.4182 1.4182 1.4140 604.725766
In [8]: df.loc[idx[:, 'B'], :] = df.loc[idx[:, 'B'], :].fillna(method='ffill')
In [9]: df
Out[9]:
high1 low1 high2 low2 offset
timestamp channel
2021-01-01 A 966.6100 965.0300 967.7900 965.0300 27.307721
B 1.4105 1.3900 1.4105 1.3900 2078.353670
2021-01-02 A 965.0300 966.4700 966.4800 965.0000 35.402437
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-03 A 966.4600 966.0100 967.6800 965.4200 19.896296
B 1.3900 1.3890 1.4028 1.3890 726.717821
2021-01-04 A 966.6300 967.0000 967.0000 966.0300 12.958161
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-05 A 967.0000 967.2000 967.2000 967.0000 10.345234
B 1.4139 1.4140 1.4140 1.4139 692.804577
2021-01-06 A 967.2000 967.0000 967.2500 967.0000 7.026761
B 1.4140 1.4182 1.4182 1.4140 604.725766