I am struggling to extract text from a file.. The text is in the following format with [] signifying a delimiter.
File Text:
[Dataset 1] "text" [Filename 1] "text" [Filename 2] "text" [Key Data Delimiter] !key data! [Key Data Delimiter] "text" [Filename 3] "text" [Dataset 2] "text" [Filename 1] [Key Data Delimiter] key data [Key Data Delimeter] "text" [Filename 2] [Dataset 3]...
Desired Output:
[Dataset 1], [Filename 2], !key data!. [Dataset 2], [Filename 1], !key data!.
With the filename being after which filename the key delimiter appears and before another Dataset. There is only one file containing key data per Dataset.
f = open('file.txt', 'r')
TextBetween_KeyDataDelimeter = re.findall('KeyDataDelimeter(. ?)KeyDataDelimiter',f.read(), re.DOTALL)
I'm thinking of nested for loops with if/else statements but that seems quite messy. Can someone please point me to docs I should read to help me out.
CodePudding user response:
Here's an option without regex, just some string and list manipulations. Somewhat convoluted, but it works:
kds = """[Dataset 1] "text1" [Filename 1] "text2" [Filename 2] "text3" [Key Data Delimiter] !key data1![Key Data Delimiter] "text4" [Filename 3] "text5" [Dataset 2] "text6" [Filename 1] [Key Data Delimiter] key data2 [Key Data Delimeter] "text7" [Filename 2]"""
# split the text file into datasets
nkds = kds.replace('[Dataset','xxx[Dataset').split('xxx')
for k in nkds[1:]:
entry = ''
#split each dataset into components
nk = k.replace('[','xxx[').split('xxx')[1:]
#get the name of the dataset
entry = nk[0].replace(']',']xxx').split('xxx')[0]
for k in nk:
#find the index position of the delimiter in the dataset list
if '[Key Data Delimiter]' in k:
#get the previous index position for the file name
file_ind = nk.index(k)-1
entry = nk[file_ind].replace(']',']xxx').split('xxx')[0]
entry = k.split(']')[1].strip()
break
print(entry)
Output:
[Dataset 1][Filename 2]!key data1!
[Dataset 2][Filename 1]key data2
CodePudding user response:
With re.findall
function, would you please try:
import re
with open('file.txt') as f:
for line in f:
m = re.findall(r'(\[Dataset.*?\]).*?(\[Filename.[^]]?\])[^[]*\[Key Data Delimiter\](.*?)\[Key Data Delimiter\]', line)
print([x for i in m for x in i]) # flatten list of tuples
Output:
['[Dataset 1]', '[Filename 2]', ' !key data! ', '[Dataset 2]', '[Filename 1]', ' key data ']
The regex matches the dataset
, the filename
being after which filename the key delimiter appears, and the key data
surrounded by the key delimiters.
The result is purposely flattened to meet the desired output but it might be better to keep the original 2-d structure depending on the usage.
BTW your file.txt
has a typo in the 2nd dataset as Key Data Delimeter
.