Home > Enterprise >  detect number of lines in a text file between specific string pattern and categorize based on the ou
detect number of lines in a text file between specific string pattern and categorize based on the ou

Time:06-03

I have a text file that I am trying to categorize based on the number of lines between the lines with word 'START' and 'END /'. I/p files structure:

  START               
  Action1
  Action2 
  Action3
  END /

  START
  Action1 
  END /

  START                  
  Action1
  Action2
  END /

  START  
  Action0              
  Action1
  Action2 
  Action3
  END /

  START
  Action1 
  END /
 

The code should detect the number of lines between 'START' and 'END /' and categorize in the following manner: if only 1 action line then 'P1' ; if more than one action line then 'P2'

So the output of the depicted i/p file can be given as:

['P2', 'P1', 'P2', 'P2', 'P1']

The end goal is to export this output list into an excel column (as shown). I believe this can be done with help of pandas library, however, any suggestions for the same will be appreciated.

Category
P2
P1
P2
P2
P1

Initially I am able to print out the corresponding line number for the entire file, so was also thinking of extracting the line numbers. However, there was a major flaw to idea since the number of Actions lines vary.

with open('filepath.txt') as f:
    for index, line in enumerate(f):
        print("Line {}: {}".format(index, line.strip()))
            

initial flawed idea output:

Line 0: 
Line 1: A
Line 2: Action1
Line 3: Action2
Line 4: Action3
Line 5: B
Line 6: 
Line 7: A
Line 8: Action1
Line 9: B
Line 10: 
Line 11: A
Line 12: Action1
Line 13: Action1
Line 14: B
Line 15: 
Line 16: A
Line 17: Action0
Line 18: Action1
Line 19: Action2
Line 20: Action3
Line 21: B

Then I came up with the idea of detecting the initial (START) and final (END) pattern , count the lines in between and with if else statement can assign P1 or P2 category. Currently stuck on implementing a way to count lines within the pattern.

Any help with the code will be helpful, thank you!

CodePudding user response:

If the file data is exactly what you mentioned in your question then the following code should work.

import pandas as pd

result = []
fp = 'your_file.txt'                       # change this

with open(fp) as file:
    file_content = file.read().splitlines()
    count = 0

    # this is the logic you were after:
    for item in file_content:
        if item.strip() == 'START':
            count = 0
        elif item.strip() == 'END /':
            if count <= 1:
                result.append('P1')
            else:
                result.append('P2')
        else:
            count  = 1

print(result)

dataframe = pd.DataFrame(result, columns=['Category'])

# Note: Pandas module needs openpyxl module installed for this next step
dataframe.to_excel('excel.xlsx', index=False)
  • Related