i am trying to read a text file with these binary files and only get the text betwen the d,d,c of the strings. Once i get that value I need to put them together in a list of lists. I am however having a difficult time because when i print A below...the strings go in different lines. When I print micr_ocr_dat then i get a list of elements but duplicated. See output examples below.
input.txt
d01234567d1929739798c02798 x\00x\00 x\00x\00 d01234567d1929739798c02798
d7827688d389137c3311 x\00x\00 x\00x\00 d7827688d12233333c3311
notice in this output it duplicates each string...
output when printing micr_ocr_dat:
['d01234567d1929739798c02798' ,'d01234567d1929739798c02798']
['d01234567d1929739798c02798' ,'d01234567d1929739798c02798']
['d7827688d389137c3311', 'd7827688d389137c3311']
['d7827688d12233333c3311','d7827688d12233333c3311']
take close look at second element which is getting the second line of my text file and putting them together in a list and not duplicating them like in the outcome above..
expected output
['d01234567d1929739798c02798' ,'d01234567d1929739798c02798']
['d7827688d389137c3311','d7827688d12233333c3311']
code:
with open(fp, 'r', encoding="ISO-8859-1") as in_file:
data = in_file.readlines()
print(fp)
for row in data:
micr_ocr_line = re.findall(r'd[^d]*d[^d]*c[0-9] |d[^d]*d[^d]*c\s [0-9] ', row)
micr_ocr_dat_l.append(micr_ocr_line)
for r in micr_ocr_line:
rmve_spcl_char = re.sub (r'([^a-zA-Z-0-9] ?)', '', r)
rmve_spcl_char = re.sub(r'(c\d{4,}).*', r'\1', rmve_spcl_char).strip()
a = [l for l in rmve_spcl_char.split('\n')]
for previous, current in zip(a, a[::1]):
micr_ocr_dat = [previous, current]
print(micr_ocr_dat)
CodePudding user response:
Another approach could be to split the text into words, keep only those matching a certain pattern and then split those words using re.split()
import re
pattern = re.compile("^([dc][0-9] ) $")
with open('filename', 'r', encoding="ISO-8859-1") as in_file:
data = in_file.read()
# Split text into words
data = data.split()
final_data = []
for word in data:
# Check if the word only contains numbers or the letters c and d
if pattern.match(word):
# Cut the word at each letter c or d
values = re.split('[cd]', word)
final_data.append(values[1:])
print(final_data)
Output:
[
['01234567', '1929739798', '02798'],
['01234567', '1929739798', '02798'],
['7827688', '389137', '3311'],
['7827688', '12233333', '3311']
]
CodePudding user response:
Regex split applied on each relevant string, simple yet short and efficient.
import re
with open('data.dat', 'r') as f:
data = [re.split(r'd|c', x)[1:] for x in f.read().split() if '\\' not in x]
print(data)
Output:
[['01234567', '1929739798', '02798'], ['01234567', '1929739798', '02798'], ['7827688', '389137', '3311'], ['7827688', '12233333', '3311']]