I have got a unique use case. I have got a txt file in the following format where every line starts with "APA" and ends with "||" (varies in length and content, does not matter)
APA lEDGER|5023|124223|STAFF NAME|XYZ|123||
APA lEDGER|5023|124223|STAFF NAME|XYZ|131|12r2gw||
APA lEDGER|5023|124223|STAFF NAME|XYZ|43s|12|123sdfq|prime||
In some lines however, due to unknown reasons, some of these lines are split like so:
APA lEDGER|5023|hello|
40937 / 903.01
for period: 2021|8|332.48||
Technically this line should have been like:
APA lEDGER|5023|hello|40937 / 903.01 for period: 2021|8|332.48||
The file is of really huge size (16MB), and this is the logic I have come up with:
Read each line into a list of strings and apply the following algorithm:
pattern = re.compile("\s*[^APA]")
patternOK = re.compile("\s*[APA]")
final_list = [] --list to store the cleaned strings
ptr=0
for elem in string_list:
if(elem.startswith ("APA") and elem.endswith("||")):
final_list.append(elem) --add each string with the proper format to the the final list
--maintain a pointer, point it to the current string and one to the previous string, if the current string does not start with an APA then append all the strings in a while loop and then append it back to the previous proper string
if(ptr<len(string_list)):
if (ptr - 1 >= 0):
prev_el = str(string_list[ptr-1])
curr_el = str(string_list[ptr])
if(pattern.match(curr_el.strip())):
while (pattern.match(string_list[ptr])):
prev_el = prev_el '|' string_list[ptr]
ptr = ptr 1
if(ptr 1 > len(string_list)):
break
final_list.append(prev_el) --append the cleaned string to the final list
ptr = timer 1
Mostly, it works. However, I could see some of the results were omitted or not in the order of insert. Please feel free to provide your own logic as well.
In summary, I need a list of strings with the correct format mentioned above.
Thanks
CodePudding user response:
file = open("file.txt.", "r")
file_out = open("file_out.txt", "w")
lines = file.readlines()
file.close()
my_line = ""
for line in lines:
if (str(line[-1]) str(line[-2])) != "||":
my_line = line.replace("\n", "")
else:
my_line = line.replace("\n", "")
file_out.write(my_line "\n")
my_line = ""
file_out.close()
I think this should work.
It will check if the line ends with "||". If it doesn't it will store the line and move to the next one. If the next one contains "||" at the end, it will write it to the output file, otherwise it will also move forward.
And I used .replace("\n", "")
to remove the potential new line character.
CodePudding user response:
It appears that you don't only have ||
at the end of the line, but also in between columns. If this is a typo, the normal .split()
function would be sufficient, however if it's not, you can use re.split()
to only split on ||
at the end of lines. Afterwards, remove the line breaks from the resulting list elements and finally join everything back together:
import re
data = """APA lEDGER|5023|124223|STAFF NAME|XYZ|123||
APA lEDGER|5023|124223|STAFF NAME|XYZ|131|12r2gw||
APA lEDGER|5023|124223|STAFF NAME|XYZ|43s|12|123sdfq||prime||
APA lEDGER|5023|hello|
40937 / 903.01
for period: 2021|8|332.48||
"""
r = re.compile(r'\|\|$', re.M)
splits = re.split(r, data)
clean_lines = (line.replace('\n', ' ').strip() for line in splits)
clean_file = '||\n'.join(clean_lines)
print(clean_file.splitlines())
Output:
['APA lEDGER|5023|124223|STAFF NAME|XYZ|123||',
'APA lEDGER|5023|124223|STAFF NAME|XYZ|131|12r2gw||',
'APA lEDGER|5023|124223|STAFF NAME|XYZ|43s|12|123sdfq||prime||',
'APA lEDGER|5023|hello| 40937 / 903.01 for period: 2021|8|332.48||']
CodePudding user response:
You can use some tricks, like this.
[Input]
text = """APA lEDGER|5023|124223|STAFF NAME|XYZ|123||
APA lEDGER|1178|124223|STAFF NAME|XYZ|131|12r2gw||
APA lEDGER|9845|hello|
40937 / 903.01
for period: 2021|8|332.48||
APA lEDGER|9954|124223|STAFF
NAME|XYZ|43s|12|
123sdfq|
prime||
APA lEDGER|3649|124223|STAFF NAME|XYZ|123||
"""
[Code]
# One line to split,replace and join.
output= ''.join(text).replace('\n','').replace('||','||#').split('#')
# Print the output
print(output)
# Save the output to a file
with open('Output.txt', 'w') as f:
for item in output:
f.write(f"{item}\n")
f.close()
[Output]
APA lEDGER|5023|124223|STAFF NAME|XYZ|123||
APA lEDGER|1178|124223|STAFF NAME|XYZ|131|12r2gw||
APA lEDGER|9845|hello|40937 / 903.01for period: 2021|8|332.48||
APA lEDGER|9954|124223|STAFFNAME|XYZ|43s|12|123sdfq|prime||
APA lEDGER|3649|124223|STAFF NAME|XYZ|123||