I'd like to find the start and stop points of a column and flag them as shown below:
value | flag |
---|---|
NaN | NaN |
NaN | NaN |
1 | start |
2 | NaN |
1 | NaN |
3 | NaN |
2 | stop |
NaN | NaN |
1 | start |
2 | stop |
CodePudding user response:
start
occurs when the current value isnotnull
and the previous valueisnull
stop
occurs when the current value isnotnull
and the next valueisnull
Generate these conditions using shift
and assign using loc
:
start = df.value.notnull() & df.value.shift().isnull()
stop = df.value.notnull() & df.value.shift(-1).isnull()
df.loc[start, 'flag'] = 'start'
df.loc[stop, 'flag'] = 'stop'
# value flag
# 0 NaN NaN
# 1 NaN NaN
# 2 1.0 start
# 3 2.0 NaN
# 4 1.0 NaN
# 5 3.0 NaN
# 6 2.0 stop
# 7 NaN NaN
# 8 1.0 start
# 9 2.0 stop
Alternatively assign using mask
:
df['flag'] = df['flag'].mask(start, 'start')
df['flag'] = df['flag'].mask(stop, 'stop')
CodePudding user response:
Here I iterated through the rows and used a flag to indicate if we were starting or not.
start_flag = 0
for index, row in df.iterrows():
if row['val'].isnull():
df.loc[index, 'flag'] = "NaN"
start_flag = 0
else:
if start_flag == 0:
df.loc[index, 'flag'] = "start"
start_flag = 1
else:
if (index < df.shape[0]-1 and df.loc[index 1, 'val'].isnull()) or index == df.shape[0]-1:
df.loc[index, 'flag'] = "stop"
CodePudding user response:
Here's what you need:
# Auxiliar columns to detect start and end
df['Past'] = df['Value'].shift(-1)
df['Future'] = df['Value'].shift(1)
# Auxiliar function to complete new column
def Find_start_stop_Null(row):
flag = np.nan
if ((not pd.isnull(row['Value'])) and (pd.isnull(row['Future']))):
flag = 'start'
elif ((not pd.isnull(row['Value'])) and (pd.isnull(row['Past']))):
flag = 'stop'
return flag
df['flag'] = df.apply(lambda row: Find_start_stop_Null(row), axis=1)
# Drop unnecessary columns
df = df.drop('Past', axis=1)
df = df.drop('Future', axis=1)
CodePudding user response:
df = pd.DataFrame({'value': [np.nan, np.nan, 1, 2, 1, 3, 2, np.nan, 1, 2]}, dtype='object')
df['flag'] = df.value.groupby(df.value.notna().diff().ne(0).cumsum()). \
apply(lambda se: pd.Series(['start'] [np.nan] * (se.size - 2) ['stop'], index=se.index) if se.notna().sum() else se)
print(df)
value flag
0 NaN NaN
1 NaN NaN
2 1 start
3 2 NaN
4 1 NaN
5 3 NaN
6 2 stop
7 NaN NaN
8 1 start
9 2 stop