Say, I have a Pandas Series looking like below (empty means missing value). For simplicity I use integer indices here, but in reality it's datatimeindex.
0,
1,5
2,3
3,
4,5
5,
6,30
7,5
8,5
9,31
10,31
11,
12,5
13,5
I want to ffill the value of 5, only if the previous neighbor falls within a specific list of values, e.g., [30, 31, 32]. The output for the above example should be:
0,
1,5
2,3
3,
4,5
5,
6,30
7,30
8,30
9,31
10,31
11,
12,5
13,5
How can I achieve this?
This is a data cleaning task I am struggling with. The goal is to correct the wrongful coding of weather condition impacted by the preceding events.
CodePudding user response:
Using a mask with a shift you can achieve this functionality
# Mask for the value of 5 and if the previous neighbor falls within a specific list of values
mask = (s == 5) & (s.shift().isin([30, 31, 32]))
# Replace the values with whatever you like
s = s.where(~mask, 0)
CodePudding user response:
If I've understood you correctly, this could work:
import numpy as np
import pandas as pd
# Create the original Series with missing values represented as None
s = pd.Series([None, 5, 3, None, 5, None, 30, 5, 5, 31, 31, None, 5, 5],
index=[0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13])
# Define the list of allowed preceding values
allowed_values = [30, 31, 32]
# Create an array of zeros with the same shape as s.values
modified_series = np.zeros_like(s.values)
# Replace all 5s in modified_series with np.nan
modified_series = np.where(s.values == 5, np.nan, modified_series)
# Replace all values in modified_series that are in allowed_values with the corresponding value in s
modified_series = np.where(s.isin(allowed_values), s, modified_series)
# Convert modified_series to a Pandas Series, preserve the original index, and forward fill np.nan values
modified_series = pd.Series(modified_series, index=s.index).ffill()
# Replace all 5s in s with the corresponding value in modified_series
modified_series = np.where(s == 5, modified_series.values, s)
# Convert modified_series to a Pandas Series and preserve the original index
modified_series = pd.Series(modified_series, index=s.index)
# Print the modified Series
print(modified_series)
This should return:
0 NaN
1 5.0
2 3.0
3 NaN
4 5.0
5 NaN
6 30.0
7 30.0
8 30.0
9 31.0
10 31.0
11 NaN
12 5.0
13 5.0
EDIT: Changed to remove for loop and use vectorization instead.
CodePudding user response:
here's my cumbersome solution without looping (but generating several intermediate columns)
import pandas as pd
df = pd.DataFrame([
None, 5, 3, None, 5, None, 30, 5, 5, 31,
31, None, 5, 5], columns=['val'])
target_values = [30, 31, 32]
df['target'] = df.val.isin(target_values) # create bool mask
# index each number w/o considering 5
df['seq_idx'] = (df.val!=5).cumsum()
# tag indexes that contain values that will be replaced
df['to_replace'] = df.groupby(
'seq_idx')['target'].transform('first')
# get replacement values (first of each 'sequence')
df['replace_val'] = df.groupby(
'seq_idx')['val'].transform('first')
# actually replace them
df.loc[df.to_replace,'val'] = df.loc[df.to_replace, 'replace_val']
And you just need to drop some columns :)