I have a dataframe with random repeating sequences. I need to set a flag of sorts so that every time I come across the term maintenance, I store the rows between each instance. " store everything between the two maintenance instances." The input is:
NAME PROCESS STATUS
name 0 process 0
name 1 maintenance start
name 2 process 2
name 3 process 3
name 4 process 4
name 5 process 5
name 6 maintenance stop
name 7 process 7
name 8 process 8
name 9 process 9
name 10 maintenance start
name 11 process 11
name 12 process 12
name 13 process 13
name 14 process 14
name 15 maintenance stop
I was thinking about doing a set of logical conditions:
for i in np.arange(len(df)-1):
if df['process'][i] = 'maintenance':
df['new'] = df[i]
but, I am hoping there is a way that pandas can handle this (that I cant find) as I cant seem to figure out the stopping condition.
desired output:
name 2 process 2
name 3 process 3
name 4 process 4
name 5 process 5
name 11 process 11
name 12 process 12
name 13 process 13
name 14 process 14
CodePudding user response:
You can iterate over the rows in data frame and start appending rows to new frame when process="maintenance" and status="start" and stop adding them when status="stop".
import pandas as pd
from io import StringIO
data = """name,process,status
name 0,process 0,
name 1,maintenance,start
name 2,process 2,
name 3,process 3,
name 4,process 4
name 5,process 5
name 6,maintenance,stop
name 7,process 7
name 8,process 8
name 9,process 9
name 10,maintenance,start
name 11,process 11
name 12,process 12
name 13,process 13
name 14,process 14
name 15,maintenance,stop
"""
df = pd.read_csv(StringIO(data)).fillna('')
row_list = []
collect = False
for row in df.itertuples(index=False):
if row.process == "maintenance":
collect = row.status == "start"
elif collect:
row_list.append(row)
df2 = pd.DataFrame.from_records(row_list, columns=df.columns)
print(df2)
Output:
name process status
0 name 2 process 2
1 name 3 process 3
2 name 4 process 4
3 name 5 process 5
4 name 11 process 11
5 name 12 process 12
6 name 13 process 13
7 name 14 process 14
If want to drop the "status" column then call this:
df2 = df2.drop('status', axis=1)