Extracting text from text file with recurring nested pattern-CodePudding

I am struggling to extract text from a file.. The text is in the following format with [] signifying a delimiter.

File Text:

[Dataset 1] "text" [Filename 1] "text" [Filename 2] "text" [Key Data Delimiter] !key data! [Key Data Delimiter] "text" [Filename 3] "text" [Dataset 2] "text" [Filename 1] [Key Data Delimiter] key data [Key Data Delimeter] "text" [Filename 2] [Dataset 3]...

Desired Output:

[Dataset 1], [Filename 2], !key data!. [Dataset 2], [Filename 1], !key data!.

With the filename being after which filename the key delimiter appears and before another Dataset. There is only one file containing key data per Dataset.

f = open('file.txt', 'r')
TextBetween_KeyDataDelimeter = re.findall('KeyDataDelimeter(. ?)KeyDataDelimiter',f.read(), re.DOTALL)

I'm thinking of nested for loops with if/else statements but that seems quite messy. Can someone please point me to docs I should read to help me out.

CodePudding user response：

Here's an option without regex, just some string and list manipulations. Somewhat convoluted, but it works:

kds = """[Dataset 1] "text1" [Filename 1] "text2" [Filename 2] "text3" [Key Data Delimiter] !key data1![Key Data Delimiter] "text4" [Filename 3] "text5" [Dataset 2] "text6" [Filename 1] [Key Data Delimiter] key data2 [Key Data Delimeter] "text7" [Filename 2]"""

# split the text file into datasets
nkds = kds.replace('[Dataset','xxx[Dataset').split('xxx')

for k in nkds[1:]:
    entry = ''
    #split each dataset into components
    nk = k.replace('[','xxx[').split('xxx')[1:]
    #get the name of the dataset
    entry = nk[0].replace(']',']xxx').split('xxx')[0]
    for k in nk:
        #find the index position of the delimiter in the dataset list
        if '[Key Data Delimiter]' in k:
            #get the previous index position for the file name
            file_ind = nk.index(k)-1
            entry = nk[file_ind].replace(']',']xxx').split('xxx')[0]
            entry = k.split(']')[1].strip()
            break
    print(entry)

Output:

[Dataset 1][Filename 2]!key data1!
[Dataset 2][Filename 1]key data2

CodePudding user response：

With re.findall function, would you please try:

import re

with open('file.txt') as f:
    for line in f:
        m = re.findall(r'(\[Dataset.*?\]).*?(\[Filename.[^]]?\])[^[]*\[Key Data Delimiter\](.*?)\[Key Data Delimiter\]', line)
        print([x for i in m for x in i])        # flatten list of tuples

Output:

['[Dataset 1]', '[Filename 2]', ' !key data! ', '[Dataset 2]', '[Filename 1]', ' key data ']

The regex matches the dataset, the filename being after which filename the key delimiter appears, and the key data surrounded by the key delimiters.

The result is purposely flattened to meet the desired output but it might be better to keep the original 2-d structure depending on the usage.

BTW your file.txt has a typo in the 2nd dataset as Key Data Delimeter.