How do I merge every single batch of consecutive lines in a .txt file?
Example:
Turn this:
User#0001
Hello
Whats Up
User#0002
Hi
...
into this:
User#0001 Hello Whats Up
User#0002 Hi
...
I want to merge all of the lines because when I've tried doing this:
pattern = r'([a-zA-Z] #[0-9] .)(. ?(?:^$|\Z))'
data = {
'name': [],
'message': []
}
with open('chat.txt', 'rt') as file:
for message in file.readlines():
match = re.findall(pattern, message, flags=re.M|re.S)
print(match)
if match:
name, message = match[0]
data['name'].append(name)
data['message'].append(message)
I got this when printing 'match':
[('User#0001', '\n')]
[]
[]
[]
[('User#0002', '\n')
...
And when manually editing some of the lines to be User#0001 message
then it does return the correct output.
CodePudding user response:
I would phrase your requirement using re.sub
:
inp = """User#0001
Hello
Whats Up
User#0002
Hi"""
output = re.sub(r'(?<!\n)\n(?=\S)', ' ', inp)
print(output)
This prints:
User#0001 Hello Whats Up
User#0002 Hi
The regex used here says to match:
(?<!\n)
assert that newline does not precede\n
match a single newline(?=\S)
assert that non whitespace follows
The (?<!\n)
ensures that we do not remove the newline on the line before a text block begins. The (?=\S)
ensures that we do not remove the final newline in a text block.
CodePudding user response:
Another solution (regex demo):
import re
s = """\
User#0001
Hello
Whats Up
User#0002
Hi"""
pat = re.compile(r"^(\S #\d )\s*(.*?)\s*(?=^\S #\d |\Z)", flags=re.M | re.S)
out = [(user, messages.splitlines()) for user, messages in pat.findall(s)]
print(out)
Prints:
[("User#0001", ["Hello", "Whats Up"]), ("User#0002", ["Hi"])]
If you want to join the messages to one line:
for user, messages in out:
print(user, " ".join(messages))
Prints:
User#0001 Hello Whats Up
User#0002 Hi
CodePudding user response:
First, I suspect that your need is for historical recording.
Then I would say that you do not need a dictionary.
I propose a list where each element would be (user,message).
Second, complexity bring difficulties and bugs. Do you really need regex?
What's wrong with this simple solution:
t= [
"User#0001\n",
"Hello\n",
"Whats Up\n",
"\n",
"\n",
"User#0002\n",
"Hi\n",
"...\n",
]
data=[]
for line in t:
line = line.strip() # remove spaces and \n
if line.strip().startswith( "User#"):
data.append( [line,""])
else:
data[-1][1] = ' ' line
for msg in data:
print( msg[0], msg[1] if len(msg)>1 else "")
CodePudding user response:
For the format of the given example, if you want to keep the same amount of newlines, you can use a pattern with 3 capture groups.
^([a-zA-Z] #[0-9] )((?:\n(?![a-zA-Z] #[0-9]). )*)
The pattern matches:
^
Start of string([a-zA-Z] #[0-9] )
Capture group 1(
Capture group 1(?:
Non capture group\n
Match a newline(?![a-zA-Z] #[0-9])
Negative lookahead, assert not 1 chars a-zA-Z to the right followed by#
and a digit.
Match 1 chars (In your pattern you used^$
to stop when there is an empty string, but you can also make sure to match 1 or more characters)
)*
Close the non capture group and optionally repeat it to also allow 0 occurrences)
Close group 2
import re
s = """User#0001
Hello
Whats Up
User#0002
Hi
User#0003"""
pattern = r"^([a-zA-Z] #[0-9] )((?:\n(?![a-zA-Z] #[0-9]). )*)(\n*)"
result = []
for (u, m, n) in re.findall(pattern, s, re.M):
result.append(f"{' '.join([u] m.split())}{n}")
print("".join(result))
Output
User#0001 Hello Whats Up
User#0002 Hi
User#0003