I have a text file that I am trying to categorize based on the number of lines between the lines with word 'START' and 'END /'. I/p files structure:
START
Action1
Action2
Action3
END /
START
Action1
END /
START
Action1
Action2
END /
START
Action0
Action1
Action2
Action3
END /
START
Action1
END /
The code should detect the number of lines between 'START' and 'END /' and categorize in the following manner: if only 1 action line then 'P1' ; if more than one action line then 'P2'
So the output of the depicted i/p file can be given as:
['P2', 'P1', 'P2', 'P2', 'P1']
The end goal is to export this output list into an excel column (as shown). I believe this can be done with help of pandas library, however, any suggestions for the same will be appreciated.
Category
P2
P1
P2
P2
P1
Initially I am able to print out the corresponding line number for the entire file, so was also thinking of extracting the line numbers. However, there was a major flaw to idea since the number of Actions lines vary.
with open('filepath.txt') as f:
for index, line in enumerate(f):
print("Line {}: {}".format(index, line.strip()))
initial flawed idea output:
Line 0:
Line 1: A
Line 2: Action1
Line 3: Action2
Line 4: Action3
Line 5: B
Line 6:
Line 7: A
Line 8: Action1
Line 9: B
Line 10:
Line 11: A
Line 12: Action1
Line 13: Action1
Line 14: B
Line 15:
Line 16: A
Line 17: Action0
Line 18: Action1
Line 19: Action2
Line 20: Action3
Line 21: B
Then I came up with the idea of detecting the initial (START) and final (END) pattern , count the lines in between and with if else statement can assign P1 or P2 category. Currently stuck on implementing a way to count lines within the pattern.
Any help with the code will be helpful, thank you!
CodePudding user response:
If the file data is exactly what you mentioned in your question then the following code should work.
import pandas as pd
result = []
fp = 'your_file.txt' # change this
with open(fp) as file:
file_content = file.read().splitlines()
count = 0
# this is the logic you were after:
for item in file_content:
if item.strip() == 'START':
count = 0
elif item.strip() == 'END /':
if count <= 1:
result.append('P1')
else:
result.append('P2')
else:
count = 1
print(result)
dataframe = pd.DataFrame(result, columns=['Category'])
# Note: Pandas module needs openpyxl module installed for this next step
dataframe.to_excel('excel.xlsx', index=False)