How to merge/join consecutive lines-CodePudding

How do I merge every single batch of consecutive lines in a .txt file?

Example:

Turn this:

User#0001
Hello
Whats Up


User#0002
Hi
...

into this:

User#0001 Hello Whats Up


User#0002 Hi
...

I want to merge all of the lines because when I've tried doing this:

pattern = r'([a-zA-Z] #[0-9] .)(. ?(?:^$|\Z))'

data = {
    'name': [],
    'message': []
}

with open('chat.txt', 'rt') as file:
  for message in file.readlines():
    match = re.findall(pattern, message, flags=re.M|re.S)
    print(match)
    if match:
      name, message = match[0]
      data['name'].append(name)
      data['message'].append(message)

I got this when printing 'match':

[('User#0001', '\n')]
[]
[]
[]
[('User#0002', '\n')
...

And when manually editing some of the lines to be User#0001 message then it does return the correct output.

CodePudding user response：

I would phrase your requirement using re.sub:

inp = """User#0001
Hello
Whats Up


User#0002
Hi"""

output = re.sub(r'(?<!\n)\n(?=\S)', ' ', inp)
print(output)

This prints:

User#0001 Hello Whats Up


User#0002 Hi

The regex used here says to match:

(?<!\n) assert that newline does not precede
\n match a single newline
(?=\S) assert that non whitespace follows

The (?<!\n) ensures that we do not remove the newline on the line before a text block begins. The (?=\S) ensures that we do not remove the final newline in a text block.

CodePudding user response：

Another solution (regex demo):

import re

s = """\
User#0001
Hello
Whats Up


User#0002
Hi"""

pat = re.compile(r"^(\S #\d )\s*(.*?)\s*(?=^\S #\d |\Z)", flags=re.M | re.S)
out = [(user, messages.splitlines()) for user, messages in pat.findall(s)]
print(out)

Prints:

[("User#0001", ["Hello", "Whats Up"]), ("User#0002", ["Hi"])]

If you want to join the messages to one line:

for user, messages in out:
    print(user, " ".join(messages))

Prints:

User#0001 Hello Whats Up
User#0002 Hi

CodePudding user response：

First, I suspect that your need is for historical recording.
Then I would say that you do not need a dictionary.
I propose a list where each element would be (user,message).
Second, complexity bring difficulties and bugs. Do you really need regex?
What's wrong with this simple solution:

t= [
"User#0001\n",
"Hello\n",
"Whats Up\n",
"\n",
"\n",
"User#0002\n",
"Hi\n",
"...\n",
]

data=[]

for line in t:
  line = line.strip() # remove spaces and \n
  if line.strip().startswith( "User#"):
    data.append( [line,""])
  else:
    data[-1][1]  = ' '   line
for msg in data:
  print( msg[0], msg[1] if len(msg)>1 else "")

CodePudding user response：

For the format of the given example, if you want to keep the same amount of newlines, you can use a pattern with 3 capture groups.

^([a-zA-Z] #[0-9] )((?:\n(?![a-zA-Z] #[0-9]). )*)

The pattern matches:

^ Start of string
([a-zA-Z] #[0-9] ) Capture group 1
( Capture group 1
- (?: Non capture group
- \n Match a newline
- (?![a-zA-Z] #[0-9]) Negative lookahead, assert not 1 chars a-zA-Z to the right followed by # and a digit
- . Match 1 chars (In your pattern you used ^$ to stop when there is an empty string, but you can also make sure to match 1 or more characters)
)* Close the non capture group and optionally repeat it to also allow 0 occurrences
) Close group 2

Regex demo

import re

s = """User#0001
Hello
Whats Up


User#0002
Hi

User#0003"""
pattern = r"^([a-zA-Z] #[0-9] )((?:\n(?![a-zA-Z] #[0-9]). )*)(\n*)"
result = []
for (u, m, n) in re.findall(pattern, s, re.M):
    result.append(f"{' '.join([u]   m.split())}{n}")

print("".join(result))

Output

User#0001 Hello Whats Up


User#0002 Hi

User#0003