pandas find start and stop point of non-null values-CodePudding

I'd like to find the start and stop points of a column and flag them as shown below:

value	flag
NaN	NaN
NaN	NaN
1	start
2	NaN
1	NaN
3	NaN
2	stop
NaN	NaN
1	start
2	stop

CodePudding user response：

start occurs when the current value is notnull and the previous value isnull
stop occurs when the current value is notnull and the next value isnull

Generate these conditions using shift and assign using loc:

start = df.value.notnull() & df.value.shift().isnull()
stop = df.value.notnull() & df.value.shift(-1).isnull()

df.loc[start, 'flag'] = 'start'
df.loc[stop, 'flag'] = 'stop'

#    value   flag
# 0    NaN    NaN
# 1    NaN    NaN
# 2    1.0  start
# 3    2.0    NaN
# 4    1.0    NaN
# 5    3.0    NaN
# 6    2.0   stop
# 7    NaN    NaN
# 8    1.0  start
# 9    2.0   stop

Alternatively assign using mask:

df['flag'] = df['flag'].mask(start, 'start')
df['flag'] = df['flag'].mask(stop, 'stop')

CodePudding user response：

Here I iterated through the rows and used a flag to indicate if we were starting or not.

start_flag = 0
for index, row in df.iterrows():
  if row['val'].isnull():
    df.loc[index, 'flag'] = "NaN"
    start_flag = 0
  else:
    if start_flag == 0:
      df.loc[index, 'flag'] = "start"
      start_flag = 1
    else:
      if (index < df.shape[0]-1 and df.loc[index 1, 'val'].isnull()) or index == df.shape[0]-1:
         df.loc[index, 'flag'] = "stop"

CodePudding user response：

Here's what you need:

# Auxiliar columns to detect start and end
df['Past'] = df['Value'].shift(-1)
df['Future'] = df['Value'].shift(1)

# Auxiliar function to complete new column
def Find_start_stop_Null(row):
    flag = np.nan
    if ((not pd.isnull(row['Value'])) and (pd.isnull(row['Future']))):
        flag = 'start'
    elif ((not pd.isnull(row['Value'])) and (pd.isnull(row['Past']))):
        flag = 'stop'
    return flag

df['flag'] = df.apply(lambda row: Find_start_stop_Null(row), axis=1)
# Drop unnecessary columns
df = df.drop('Past', axis=1)
df = df.drop('Future', axis=1)

CodePudding user response：

df = pd.DataFrame({'value': [np.nan, np.nan, 1, 2, 1, 3, 2, np.nan, 1, 2]}, dtype='object')
df['flag'] = df.value.groupby(df.value.notna().diff().ne(0).cumsum()). \
    apply(lambda se: pd.Series(['start']   [np.nan] * (se.size - 2)   ['stop'], index=se.index) if se.notna().sum() else se)
print(df)

  value   flag
0   NaN    NaN
1   NaN    NaN
2     1  start
3     2    NaN
4     1    NaN
5     3    NaN
6     2   stop
7   NaN    NaN
8     1  start
9     2   stop