Home > Enterprise >  Best way to downsample (reduce sample rate) non time series data in Pandas
Best way to downsample (reduce sample rate) non time series data in Pandas

Time:09-30

I have a dataframe that contains data collected every 0.01m down into the earth. Due to its high resolution the resulting size of the dataset is very large. Is there a way in pandas to downsample to 5m intervals thus reducing the size of the dataset?

RESULT (Every 0.01m)

Depth_m value
1.34 31.7
1.35 31.7
1.36 31.7
1.37 31.9
1.38 31.9
1.39 31.9
1.40 31.9
.... .....
44.35 32.9
44.36 32.9
44.37 32.9

OUTCOME I WANT (Every 5m)

Depth_m value
5.47 31.7
10.49 31.7
15.51 31.7
20.53 31.9
25.55 31.9
30.57 31.9
35.59 31.9
40.61 31.9
45.63 31.9

I have tried to use pandas.resample but that seems to only work with timeseries data. I think I understand what I must do but not sure how to do it in pandas. Basically I am thinking I need to calculate what the current sampling rate is, in this case 0.01m. Then how many observations are there every 5m. Then I can average the values based on the number of observations and drop the rows. Loop through this process every 5m.

CodePudding user response:

There is no resample for integer values. As a workaround, you could round the Depth to the nearest 5 and use groupby to get the average Value for every 5m depth:

>>> df.groupby(df["Depth_m"].apply(lambda x: 5*round(x/5)))["Value"].mean()
Depth_m
0     34.256410
5     34.274549
10    34.564870
15    34.653307
20    34.630739
25    34.517034
30    34.584830
35    34.581162
40    34.620758
45    34.390374
Name: Value, dtype: float64
Input df:
import numpy as np
np.random.seed(100)

df = pd.DataFrame({"Depth_m": [i/100 for i in range(134, 4438)],
                   "Value": np.random.randint(30, 40, size=4304)})

CodePudding user response:

You can use Panda's .iloc for selection by position coupled with a slice object to downsample. Some care must be taken to ensure you have integer step sizes and not floats when converting from non-integer sample intervals (hence the use of astype("int")).

import numpy as np
import pandas as pd

sequence_interval = 0.01
downsampled_interval = 5
step_size = np.round(downsampled_interval / sequence_interval).astype("int")

df = pd.DataFrame(
    {
        "Depth_m": np.arange(131, 4438) / 100,
        "value": np.random.random(size=4307),
    }
)

downsampled_df = df.iloc[::step_size, :]
print(downsampled_df)

The result is

      Depth_m     value
0        1.31  0.357536
500      6.31  0.384327
1000    11.31  0.302109
1500    16.31  0.200971
2000    21.31  0.689973
2500    26.31  0.712869
3000    31.31  0.776306
3500    36.31  0.221901
4000    41.31  0.661378
  • Related