detect number of lines in a text file between specific string pattern and categorize based on the ou-CodePudding

I have a text file that I am trying to categorize based on the number of lines between the lines with word 'START' and 'END /'. I/p files structure:

  START               
  Action1
  Action2 
  Action3
  END /

  START
  Action1 
  END /

  START                  
  Action1
  Action2
  END /

  START  
  Action0              
  Action1
  Action2 
  Action3
  END /

  START
  Action1 
  END /

The code should detect the number of lines between 'START' and 'END /' and categorize in the following manner: if only 1 action line then 'P1' ; if more than one action line then 'P2'

So the output of the depicted i/p file can be given as:

['P2', 'P1', 'P2', 'P2', 'P1']

The end goal is to export this output list into an excel column (as shown). I believe this can be done with help of pandas library, however, any suggestions for the same will be appreciated.

Category
P2
P1
P2
P2
P1

Initially I am able to print out the corresponding line number for the entire file, so was also thinking of extracting the line numbers. However, there was a major flaw to idea since the number of Actions lines vary.

with open('filepath.txt') as f:
    for index, line in enumerate(f):
        print("Line {}: {}".format(index, line.strip()))

initial flawed idea output:

Line 0: 
Line 1: A
Line 2: Action1
Line 3: Action2
Line 4: Action3
Line 5: B
Line 6: 
Line 7: A
Line 8: Action1
Line 9: B
Line 10: 
Line 11: A
Line 12: Action1
Line 13: Action1
Line 14: B
Line 15: 
Line 16: A
Line 17: Action0
Line 18: Action1
Line 19: Action2
Line 20: Action3
Line 21: B

Then I came up with the idea of detecting the initial (START) and final (END) pattern , count the lines in between and with if else statement can assign P1 or P2 category. Currently stuck on implementing a way to count lines within the pattern.

Any help with the code will be helpful, thank you!

CodePudding user response：

If the file data is exactly what you mentioned in your question then the following code should work.

import pandas as pd

result = []
fp = 'your_file.txt'                       # change this

with open(fp) as file:
    file_content = file.read().splitlines()
    count = 0

    # this is the logic you were after:
    for item in file_content:
        if item.strip() == 'START':
            count = 0
        elif item.strip() == 'END /':
            if count <= 1:
                result.append('P1')
            else:
                result.append('P2')
        else:
            count  = 1

print(result)

dataframe = pd.DataFrame(result, columns=['Category'])

# Note: Pandas module needs openpyxl module installed for this next step
dataframe.to_excel('excel.xlsx', index=False)