Reading Text File, Extracting Strings, Joining them In a List-CodePudding

i am trying to read a text file with these binary files and only get the text betwen the d,d,c of the strings. Once i get that value I need to put them together in a list of lists. I am however having a difficult time because when i print A below...the strings go in different lines. When I print micr_ocr_dat then i get a list of elements but duplicated. See output examples below.

input.txt

d01234567d1929739798c02798 x\00x\00 x\00x\00           d01234567d1929739798c02798 
d7827688d389137c3311    x\00x\00 x\00x\00                   d7827688d12233333c3311

notice in this output it duplicates each string...

output when printing micr_ocr_dat:

['d01234567d1929739798c02798' ,'d01234567d1929739798c02798']
['d01234567d1929739798c02798' ,'d01234567d1929739798c02798']
['d7827688d389137c3311', 'd7827688d389137c3311']
['d7827688d12233333c3311','d7827688d12233333c3311']

take close look at second element which is getting the second line of my text file and putting them together in a list and not duplicating them like in the outcome above..

expected output

['d01234567d1929739798c02798' ,'d01234567d1929739798c02798']
['d7827688d389137c3311','d7827688d12233333c3311']

code:

            with open(fp, 'r', encoding="ISO-8859-1") as in_file:
                data = in_file.readlines()
                print(fp)
                for row in data:
                    micr_ocr_line = re.findall(r'd[^d]*d[^d]*c[0-9] |d[^d]*d[^d]*c\s [0-9] ', row)
                    micr_ocr_dat_l.append(micr_ocr_line)
                    for r in micr_ocr_line:
                        rmve_spcl_char = re.sub (r'([^a-zA-Z-0-9] ?)', '', r)
                        rmve_spcl_char = re.sub(r'(c\d{4,}).*', r'\1', rmve_spcl_char).strip()
                        a = [l for l in rmve_spcl_char.split('\n')]
                        for previous, current in zip(a, a[::1]):
                            micr_ocr_dat = [previous, current]
                            print(micr_ocr_dat)

CodePudding user response：

Another approach could be to split the text into words, keep only those matching a certain pattern and then split those words using re.split()

import re

pattern = re.compile("^([dc][0-9] ) $")

with open('filename', 'r', encoding="ISO-8859-1") as in_file:
    data = in_file.read()

# Split text into words
data = data.split()

final_data = []
for word in data:
    # Check if the word only contains numbers or the letters c and d
    if pattern.match(word):
        # Cut the word at each letter c or d
        values = re.split('[cd]', word)
        final_data.append(values[1:])

print(final_data)

Output:

[
    ['01234567', '1929739798', '02798'], 
    ['01234567', '1929739798', '02798'], 
    ['7827688', '389137', '3311'], 
    ['7827688', '12233333', '3311']
]

CodePudding user response：

Regex split applied on each relevant string, simple yet short and efficient.

import re

with open('data.dat', 'r') as f:
    data = [re.split(r'd|c', x)[1:] for x in f.read().split() if '\\' not in x] 
    
print(data)

Output:

[['01234567', '1929739798', '02798'], ['01234567', '1929739798', '02798'], ['7827688', '389137', '3311'], ['7827688', '12233333', '3311']]