Best way to downsample (reduce sample rate) non time series data in Pandas-CodePudding

I have a dataframe that contains data collected every 0.01m down into the earth. Due to its high resolution the resulting size of the dataset is very large. Is there a way in pandas to downsample to 5m intervals thus reducing the size of the dataset?

RESULT (Every 0.01m)

Depth_m	value
1.34	31.7
1.35	31.7
1.36	31.7
1.37	31.9
1.38	31.9
1.39	31.9
1.40	31.9
....	.....
44.35	32.9
44.36	32.9
44.37	32.9

OUTCOME I WANT (Every 5m)

Depth_m	value
5.47	31.7
10.49	31.7
15.51	31.7
20.53	31.9
25.55	31.9
30.57	31.9
35.59	31.9
40.61	31.9
45.63	31.9

I have tried to use pandas.resample but that seems to only work with timeseries data. I think I understand what I must do but not sure how to do it in pandas. Basically I am thinking I need to calculate what the current sampling rate is, in this case 0.01m. Then how many observations are there every 5m. Then I can average the values based on the number of observations and drop the rows. Loop through this process every 5m.

CodePudding user response：

There is no resample for integer values. As a workaround, you could round the Depth to the nearest 5 and use groupby to get the average Value for every 5m depth:

>>> df.groupby(df["Depth_m"].apply(lambda x: 5*round(x/5)))["Value"].mean()
Depth_m
0     34.256410
5     34.274549
10    34.564870
15    34.653307
20    34.630739
25    34.517034
30    34.584830
35    34.581162
40    34.620758
45    34.390374
Name: Value, dtype: float64

Input df:

import numpy as np
np.random.seed(100)

df = pd.DataFrame({"Depth_m": [i/100 for i in range(134, 4438)],
                   "Value": np.random.randint(30, 40, size=4304)})

CodePudding user response：

You can use Panda's .iloc for selection by position coupled with a slice object to downsample. Some care must be taken to ensure you have integer step sizes and not floats when converting from non-integer sample intervals (hence the use of astype("int")).

import numpy as np
import pandas as pd

sequence_interval = 0.01
downsampled_interval = 5
step_size = np.round(downsampled_interval / sequence_interval).astype("int")

df = pd.DataFrame(
    {
        "Depth_m": np.arange(131, 4438) / 100,
        "value": np.random.random(size=4307),
    }
)

downsampled_df = df.iloc[::step_size, :]
print(downsampled_df)

The result is

      Depth_m     value
0        1.31  0.357536
500      6.31  0.384327
1000    11.31  0.302109
1500    16.31  0.200971
2000    21.31  0.689973
2500    26.31  0.712869
3000    31.31  0.776306
3500    36.31  0.221901
4000    41.31  0.661378