I have a dataframe that contains data collected every 0.01m down into the earth. Due to its high resolution the resulting size of the dataset is very large. Is there a way in pandas to downsample to 5m intervals thus reducing the size of the dataset?
RESULT (Every 0.01m)
Depth_m | value |
---|---|
1.34 | 31.7 |
1.35 | 31.7 |
1.36 | 31.7 |
1.37 | 31.9 |
1.38 | 31.9 |
1.39 | 31.9 |
1.40 | 31.9 |
.... | ..... |
44.35 | 32.9 |
44.36 | 32.9 |
44.37 | 32.9 |
OUTCOME I WANT (Every 5m)
Depth_m | value |
---|---|
5.47 | 31.7 |
10.49 | 31.7 |
15.51 | 31.7 |
20.53 | 31.9 |
25.55 | 31.9 |
30.57 | 31.9 |
35.59 | 31.9 |
40.61 | 31.9 |
45.63 | 31.9 |
I have tried to use pandas.resample
but that seems to only work with timeseries data. I think I understand what I must do but not sure how to do it in pandas. Basically I am thinking I need to calculate what the current sampling rate is, in this case 0.01m. Then how many observations are there every 5m. Then I can average the values based on the number of observations and drop the rows. Loop through this process every 5m.
CodePudding user response:
There is no resample
for integer values. As a workaround, you could round the Depth to the nearest 5 and use groupby
to get the average Value for every 5m depth:
>>> df.groupby(df["Depth_m"].apply(lambda x: 5*round(x/5)))["Value"].mean()
Depth_m
0 34.256410
5 34.274549
10 34.564870
15 34.653307
20 34.630739
25 34.517034
30 34.584830
35 34.581162
40 34.620758
45 34.390374
Name: Value, dtype: float64
Input df:
import numpy as np
np.random.seed(100)
df = pd.DataFrame({"Depth_m": [i/100 for i in range(134, 4438)],
"Value": np.random.randint(30, 40, size=4304)})
CodePudding user response:
You can use Panda's .iloc
for selection by position coupled with a slice object to downsample. Some care must be taken to ensure you have integer step sizes and not floats when converting from non-integer sample intervals (hence the use of astype("int")
).
import numpy as np
import pandas as pd
sequence_interval = 0.01
downsampled_interval = 5
step_size = np.round(downsampled_interval / sequence_interval).astype("int")
df = pd.DataFrame(
{
"Depth_m": np.arange(131, 4438) / 100,
"value": np.random.random(size=4307),
}
)
downsampled_df = df.iloc[::step_size, :]
print(downsampled_df)
The result is
Depth_m value
0 1.31 0.357536
500 6.31 0.384327
1000 11.31 0.302109
1500 16.31 0.200971
2000 21.31 0.689973
2500 26.31 0.712869
3000 31.31 0.776306
3500 36.31 0.221901
4000 41.31 0.661378