Home > other >  'Oversampling' cartesian data in a dataframe without for loop?
'Oversampling' cartesian data in a dataframe without for loop?

Time:09-09

I have a 3D data in a pandas dataframe that I would like to 'oversample'/smooth by replacing the value at each x,y point with the average value of all the points that are within 5 units of that point. I can do it using a for loop like this (starting with a dataframe with three columns X,Y,Z):

import pandas as pd

Z_OS = []
X_OS = []
Y_OS = []
for inddex, row in df.iterrows():
    Z_OS  = [df[(df['X'] > row['X']-5) & (df['X']<row['X'] 5) & (df['Y'] > row['Y']-10) & (df1['Y']<row['Y'] 5)]['Z'].mean()]
    X_OS  = [row['X']]
    Y_OS  = [row['Y']]

dict = {
    'X': X_OS,
    'Y': Y_OS,
    'Z': Z_OS
}
OSdf = pd.DataFrame.from_dict(dict)

but this method is very slow for large datasets and feels very 'unpythonic'. How could I do this without for loops? Is it possible via complex use of the groupby function?

CodePudding user response:

df['column_name'].rolling(rolling_window).mean()

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.rolling.html

CodePudding user response:

xy = df[['x','y']]
df['smoothed z'] = df[['z']].apply(
    lambda row: df['z'][(xy - xy.loc[row.name]).abs().lt(5).all(1)].mean(),
    axis=1
)
  • Here I used df[['z']] to get a column 'z' as a data frame. We need an index of a row, i.e. row.name, when we apply a function to this column.
  • .abs().lt(5).all(1) read as absolut values which are all less then 5 along the row.

Update

The code below is actually the same but seems more consistent as it addresses directly the index:

df.index.to_series().apply(lambda i: df.loc[(xy - xy.loc[i]).abs().lt(5).all(1), 'z'].mean())
  • Related