Home > OS >  Parsing Data From Pandas Dataframe
Parsing Data From Pandas Dataframe

Time:08-11

I have a dataframe with random repeating sequences. I need to set a flag of sorts so that every time I come across the term maintenance, I store the rows between each instance. " store everything between the two maintenance instances." The input is:

NAME     PROCESS     STATUS
name 0   process 0  
name 1   maintenance start
name 2   process 2
name 3   process 3
name 4   process 4
name 5   process 5
name 6   maintenance stop
name 7   process 7
name 8   process 8
name 9   process 9 
name 10  maintenance start
name 11  process 11
name 12  process 12
name 13  process 13
name 14  process 14
name 15  maintenance stop

I was thinking about doing a set of logical conditions:

for i in np.arange(len(df)-1):
  if df['process'][i] = 'maintenance':
     df['new'] = df[i]

but, I am hoping there is a way that pandas can handle this (that I cant find) as I cant seem to figure out the stopping condition.

desired output:

name 2   process 2
name 3   process 3
name 4   process 4
name 5   process 5
name 11  process 11
name 12  process 12
name 13  process 13
name 14  process 14

CodePudding user response:

You can iterate over the rows in data frame and start appending rows to new frame when process="maintenance" and status="start" and stop adding them when status="stop".

import pandas as pd
from io import StringIO

data = """name,process,status
name 0,process 0,
name 1,maintenance,start
name 2,process 2,
name 3,process 3,
name 4,process 4
name 5,process 5
name 6,maintenance,stop
name 7,process 7
name 8,process 8
name 9,process 9 
name 10,maintenance,start
name 11,process 11
name 12,process 12
name 13,process 13
name 14,process 14
name 15,maintenance,stop
"""

df = pd.read_csv(StringIO(data)).fillna('')

row_list = []
collect = False
for row in df.itertuples(index=False):
    if row.process == "maintenance":
        collect = row.status == "start"
    elif collect:
        row_list.append(row)

df2 = pd.DataFrame.from_records(row_list, columns=df.columns)
print(df2)

Output:

      name     process status
0   name 2   process 2
1   name 3   process 3
2   name 4   process 4
3   name 5   process 5
4  name 11  process 11
5  name 12  process 12
6  name 13  process 13
7  name 14  process 14

If want to drop the "status" column then call this:

df2 = df2.drop('status', axis=1)
  • Related