Imagine you have some text that you want to split into chunks and send to separate files, using Son Huang's solution based on l'mahdi's solution
Suppose the given text is modified such that the lines starting with note::
have some additional text before a comma, and each chunk of text has another line, starting with highlight::
:
INPUT
company:: acme products
department:: sales
floor:: 1
name:: Joe Blogs
phone:: 123456789
email:: [email protected]
address:: 123 Main Street
note:: highlight text, blah blah blah
timestamp::
highlight::
name:: Josephine Blogs
phone:: 43217890
email:: [email protected]
address:: 123 Main Street
note:: Another highlight here, More blah blah
timestamp::
highlight::
name:: John Smith
phone:: 23498689
email:: [email protected]
address:: 1 North Street
note:: Amazing text, Some more blah
timestamp::
highlight::
What needs to be added to Son Huang's solution to get the following result? You can see that the text before the comma on the line starting with notes::
now appears on the line starting with highlight::
(and the comma is gone)
DESIRED OUTPUT
# chunk_1.txt
name:: Joe Blogs
phone:: 123456789
email:: [email protected]
address:: 123 Main Street
note:: blah blah blah
timestamp:: 2022-08-07 (13h 10m 08s)
highlight:: highlight text
company:: acme products
department:: sales
floor:: 1
# chunk_2.txt
name:: Josephine Blogs
phone:: 43217890
email:: [email protected]
address:: 123 Main Street
note:: More blah blah
timestamp:: 2022-08-07 (13h 10m 09s)
highlight:: Another highlight here
company:: acme products
department:: sales
floor:: 1
# chunk_3.txt
name:: John Smith
phone:: 23498689
email:: [email protected]
address:: 1 North Street
note:: Some more blah
timestamp:: 2022-08-07 (13h 10m 10s)
highlight:: Amazing text
company:: acme products
department:: sales
floor:: 1
CodePudding user response:
Updating Son Hoang solution
from datetime import datetime
import time
import re
with open('test.txt') as f:
header, content = f.read().split('\n\n', maxsplit=1)
for n, chunk in enumerate(content.split('\n\n'), start=1):
timestamp = datetime.now().strftime('%Y-%m-%d (%Hh %Mm %Ss)')
chunk = re.sub(r'(timestamp::)', fr'\1 {timestamp}', chunk)
# Regex partition of 3 groups for line note::: i.e. 1. (note:::\s ), 2. ([^,] ) and 3. (.*)
note = re.search(r'(note::\s )([^,] ),(.*)', chunk)
# Note without 2nd group (i.e. \1 & \2 only)
chunk = re.sub(r'(note::\s )([^,] ),(.*)', fr'\1\3', chunk)
# Add 2nd group from note::: to highlight
chunk = re.sub(r'(highlight::)', fr'\1{note.group(2)}', chunk)
chunk = chunk.strip() '\n' header
print(chunk)
print()
with open(f'chunk_{n}.txt', 'w') as f_out:
f_out.write(chunk)
time.sleep(1)
Input File: test.txt
company:: acme products
department:: sales
floor:: 1
name:: Joe Blogs
phone:: 123456789
email:: [email protected]
address:: 123 Main Street
note:: highlight text, blah blah blah
timestamp::
highlight::
name:: Josephine Blogs
phone:: 43217890
email:: [email protected]
address:: 123 Main Street
note:: Another highlight here, More blah blah
timestamp::
highlight::
name:: John Smith
phone:: 23498689
email:: [email protected]
address:: 1 North Street
note:: Amazing text, Some more blah
timestamp::
highlight::
Output File: chunk1.txt
name:: Joe Blogs
phone:: 123456789
email:: [email protected]
address:: 123 Main Street
note:: blah blah blah
timestamp:: 2022-08-07 (04h 59m 56s)
highlight::highlight text
company:: acme products
department:: sales
floor:: 1
File: chunk2.txt
name:: Josephine Blogs
phone:: 43217890
email:: [email protected]
address:: 123 Main Street
note:: More blah blah
timestamp:: 2022-08-07 (04h 59m 57s)
highlight::Another highlight here
company:: acme products
department:: sales
floor:: 1
File: chunk3.txt
name:: John Smith
phone:: 23498689
email:: [email protected]
address:: 1 North Street
note:: Some more blah
timestamp:: 2022-08-07 (04h 59m 58s)
highlight::Amazing text
company:: acme products
department:: sales
floor:: 1
CodePudding user response:
Ideally, you would want to parse the data properly, and edit it that way. If you just want a quick and dirty solution, this would work though.
You could loop over each line and check if it starts with 'note:: '
, then split it based on the first comma.
I'm assuming that your data structures are unordered, so it's ok if I output the properties in a slightly different order.
for line in file:
if line.startswith('note:: '):
highlight, remainder = line.split(', ', 1)
highlight = highlight.removeprefix('note:: ')
# Write note and highlight as separate lines
output(f'note:: {remainder}')
output(f'highlight:: {highlight}')
elif line.startswith('highlight::'):
# Skip the original highlights
pass
else:
output(line)
In this case, output should be replaced to match the function you're using to write to your output file.
Keep in mind that this code isn't super robust though - if you want this to be reliable you should definitely create a system for parsing this data properly.
CodePudding user response:
This code snippet should work fine for you. Do optimize the solution as per your convenience:
from datetime import datetime
import time
import re
with open('input.txt') as f:
header, content = f.read().split('\n\n', maxsplit=1)
for n, chunk in enumerate(content.split('\n\n'), start=1):
timestamp = datetime.now().strftime('%Y-%m-%d (%Hh %Mm %Ss)')
chunk = re.sub(r'(timestamp::)', fr'\1 {timestamp}', chunk)
substitute1, substitute2, substitute3 = ("note:: ", "\n", "highlight::")
idx1, idx2 = chunk.find(substitute1), chunk.find(substitute2, chunk.find(substitute1))
text_chunk = chunk[idx1 len(substitute1): idx2] #.split('\n')
lst_chunk = text_chunk.split(',')
chunk = re.sub(text_chunk, '', chunk)
chunk = re.sub(r'(' substitute1 ')', fr'\1 {lst_chunk[1].strip()}', chunk)
chunk = re.sub(r'(' substitute3 ')', fr'\1 {lst_chunk[0].strip()}', chunk)
chunk = chunk.strip() '\n' header
with open(f'chunk_{n}.txt', 'w') as f_out:
f_out.write(chunk)
time.sleep(1)
Output:
#chunk1.txt
name:: Joe Blogs
phone:: 123456789
email:: [email protected]
address:: 123 Main Street
note:: blah blah blah
timestamp:: 2022-08-07 (13h 56m 52s)
highlight:: highlight text
company:: acme products
department:: sales
floor:: 1
#chunk2.txt
name:: Josephine Blogs
phone:: 43217890
email:: [email protected]
address:: 123 Main Street
note:: More blah blah
timestamp:: 2022-08-07 (13h 56m 53s)
highlight:: Another highlight here
company:: acme products
department:: sales
floor:: 1
#chunk3.txt
name:: John Smith
phone:: 23498689
email:: [email protected]
address:: 1 North Street
note:: Some more blah
timestamp:: 2022-08-07 (13h 56m 54s)
highlight:: Amazing text
company:: acme products
department:: sales
floor:: 1