I have a text file of messages in the form:
with open(code_path) as f:
contents = f.readlines()
print(contents)
['22/05/2022, 21.58 - Name: message1 \n',
'22/05/2022, 22.07 – Name2: message2\n',
'message2 continues\n',
'22/05/2022, 22.09 – Name: message3\n']
Currently I have the messages in strings. Some long messages are split into two. I would like to have a list of the messages in with all messages joined together (starts with the date).
This is what I want:
['22/05/2022, 21.58 - Name: message1 \n',
'22/05/2022, 22.07 – Name2: message2 message2 continues\n',
'22/05/2022, 22.09 – Name: message3\n']
Is there a way to do this?
I have found the strings starting with a date with:
import re
dates = [re.findall("^[0-3][0-9]/[0-3][0-9]/20[1-2][1-9]", i) for i in contents]
But I don't know how to continue.
CodePudding user response:
A basic approach would be to use a kind of cache: go through the lines,
- if the line starts with a date, append a new item to the cache
- if it doesn't, append to the most recent item.
messages = []
for line in contents:
if re.match(r'\d{2}/\d{2}/\d{4},\s ', line):
messages.append([line])
else:
messages[-1].append(line)
# messages
[['22/05/2022, 21.58 - Name: message1 \n'],
['22/05/2022, 22.07 – Name2: message2\n', 'message2 continues\n'],
['22/05/2022, 22.09 – Name: message3\n']]
You could then join
them (e.g., [''.join(m) for m in messages]
). Alternatively, it's also possible to build strings directly, but maybe you want to distinguish between primary/following lines at some point, then the list is more useful.
CodePudding user response:
You might also read all lines, and then match the line starting with a date like pattern followed by all all lines not starting with a date like pattern.
With a more specific date like pattern:
import re
with open("file") as f:
pattern = r"^(?:0[1-9]|[12][0-9]|3[01])/(?:0[1-9]|1[012])/\d{4},.*(?:\n(?!(?:0[1-9]|[12][0-9]|3[01])/(?:0[1-9]|1[012])/\d{4},).*)*"
print(re.findall(pattern, f.read(), re.M))
Output
[
'22/05/2022, 21.58 - Name: message1 \n',
'22/05/2022, 22.07 – Name2: message2\n\nmessage2 continues\n',
'22/05/2022, 22.09 – Name: message3\n'
]
With a less precise pattern, but a bit shorter:
^\d{2}/\d{2}/\d{4},.*(?:\n(?!\d{2}/\d{2}/\d{4},).*)*
Explanation
^
Start of string anchor\d{2}/\d{2}/\d{4},
Match a date like pattern followed by a comma.*
Match the rest of the line(?:
Non capture group to repeat as a whole part\n
Match a newline(?!\d{2}/\d{2}/\d{4},)
.*
Match the rest of the line
)*
Close the non capture group and optionally repeat to match all lines
Example
import re
with open("file") as f:
pattern = r"^\d{2}/\d{2}/\d{4},.*(?:\n(?!\d{2}/\d{2}/\d{4},).*)*"
print(re.findall(pattern, f.read(), re.M))