Home > database >  How to forward fill (ffill) specific values with neighboring values
How to forward fill (ffill) specific values with neighboring values

Time:01-04

Say, I have a Pandas Series looking like below (empty means missing value). For simplicity I use integer indices here, but in reality it's datatimeindex.

0,
1,5
2,3
3,
4,5
5,
6,30
7,5
8,5
9,31
10,31
11,
12,5
13,5

I want to ffill the value of 5, only if the previous neighbor falls within a specific list of values, e.g., [30, 31, 32]. The output for the above example should be:

0,
1,5
2,3
3,
4,5
5,
6,30
7,30
8,30
9,31
10,31
11,
12,5
13,5

How can I achieve this?

This is a data cleaning task I am struggling with. The goal is to correct the wrongful coding of weather condition impacted by the preceding events.

CodePudding user response:

Using a mask with a shift you can achieve this functionality

# Mask for the value of 5 and if the previous neighbor falls within a specific list of values
mask = (s == 5) & (s.shift().isin([30, 31, 32]))

# Replace the values with whatever you like
s = s.where(~mask, 0)

CodePudding user response:

If I've understood you correctly, this could work:

import numpy as np
import pandas as pd

# Create the original Series with missing values represented as None
s = pd.Series([None, 5, 3, None, 5, None, 30, 5, 5, 31, 31, None, 5, 5],
            index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])

# Define the list of allowed preceding values
allowed_values = [30, 31, 32]

# Create an array of zeros with the same shape as s.values
modified_series = np.zeros_like(s.values)

# Replace all 5s in modified_series with np.nan
modified_series = np.where(s.values == 5, np.nan, modified_series)

# Replace all values in modified_series that are in allowed_values with the corresponding value in s
modified_series = np.where(s.isin(allowed_values), s, modified_series)

# Convert modified_series to a Pandas Series, preserve the original index, and forward fill np.nan values
modified_series = pd.Series(modified_series, index=s.index).ffill()

# Replace all 5s in s with the corresponding value in modified_series
modified_series = np.where(s == 5, modified_series.values, s)

# Convert modified_series to a Pandas Series and preserve the original index
modified_series = pd.Series(modified_series, index=s.index)

# Print the modified Series
print(modified_series)

This should return:

0      NaN
1      5.0
2      3.0
3      NaN
4      5.0
5      NaN
6     30.0
7     30.0
8     30.0
9     31.0
10    31.0
11     NaN
12     5.0
13     5.0

EDIT: Changed to remove for loop and use vectorization instead.

CodePudding user response:

here's my cumbersome solution without looping (but generating several intermediate columns)

import pandas as pd 
df = pd.DataFrame([
    None, 5, 3, None, 5, None, 30, 5, 5, 31, 
    31, None, 5, 5], columns=['val'])

target_values = [30, 31, 32]
df['target'] = df.val.isin(target_values) # create bool mask

# index each number w/o considering 5
df['seq_idx'] = (df.val!=5).cumsum() 

# tag indexes that contain values that will be replaced
df['to_replace'] = df.groupby(
    'seq_idx')['target'].transform('first')

# get replacement values (first of each 'sequence')
df['replace_val'] = df.groupby(
    'seq_idx')['val'].transform('first') 

# actually replace them
df.loc[df.to_replace,'val']  = df.loc[df.to_replace, 'replace_val'] 

And you just need to drop some columns :)

  • Related