Home > Net >  Regex pattern in Python
Regex pattern in Python

Time:09-26

I'm looking for a pattern in Regex in Python to do the following:

For a text formatted like:

2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!

I would like to return:

[(2021-01-01,10:00:05,Surname1 Name1,Comment,Blablabla/nBlabla),
(2021-01-01,23:00:05,Surname2 SurnameBis Name2,WorkNotes,What?/nI don't know?),
(2021-01-02,03:00:05,Surname1 Name1,Comment,Blablabla!)]

I managed to find a quiet close result with:

text2 = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
Can you be clear?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""
LangTag = re.findall("(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*?)\((.*)\)\\n(.*)(?:\\n|$)", text2)
print(LangTag)

But I'm totally stuck to make appears the entire text I need to get... enter image description here

a solution can be to remove the \n from initial text but, I would like to avoid it because I need them later on... Any idea?

CodePudding user response:

You can parse your data like this.

import re

data = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""

def parse(data):
    text = ""
    match = None
    messages = []
    for line in data.split("\n"):
        m = re.match("^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) - (.*?) \((.*?)\)$", line)
        if m:
            if match:
                msg = (match.group(1), match.group(2), match.group(3), match.group(4), text)
                messages.append(msg)
            match = m
        else:
            text  = line   "\n"
    msg = (match.group(1), match.group(2), match.group(3), match.group(4), text)
    messages.append(msg)
    return messages

for message in parse(data):
    print(message)

This outputs

('2021-01-01', '10:00:05', 'Surname1 Name1', 'Comment', 'Blablabla\nBlabla\n')
('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2', 'WorkNotes', "Blablabla\nBlabla\nWhat?\nI don't know?\n")
('2021-01-02', '03:00:05', 'Surname1 Name1', 'Comment', "Blablabla\nBlabla\nWhat?\nI don't know?\nBlablabla!\n")

CodePudding user response:

Try the below.
The key point here is if line[0].isdigit() that is able to identify that a new section is starting.

Important note : this solution is removing the \n while it merges the sections.

data = '''2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!'''


holder = []
lines = data.split('\n')
temp = []
for line in lines:
    if line[0].isdigit():
        if temp:
            holder.append(' '.join(temp))
            temp = []
    temp.append(line)
holder.append(' '.join(temp))

for line in holder:
    print(line)

output

2021-01-01 10:00:05 - Surname1 Name1 (Comment) Blablabla Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes) What? I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment) Blablabla!

CodePudding user response:

My solution is almost the same as yours, but turning group 5 from .* to \D*, so it will match everything until the next number.

import re
text = """2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!"""
result = re.findall(r"(\d{4}-\d{2}-\d{2})\s(\d{2}:\d{2}:\d{2})\s-\s(.*?)\((.*)\)\n(\D*)(?:\n|$)", text)
print(result)

Output:

[('2021-01-01', '10:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla\nBlabla'),
 ('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2 ', 'WorkNotes', "What?\nI don't know?"), 
 ('2021-01-02', '03:00:05', 'Surname1 Name1 ', 'Comment', 'Blablabla!')]

CodePudding user response:

You could approach your problem by solving for the first block. Then repeat the solution to the end of your data. By doing this divide and conquer strategy, the code is simple to understand and yet can solve the bigger problem and can be extended easily.

import re

data = '''2021-01-01 10:00:05 - Surname1 Name1 (Comment)
Blablabla
Blabla
2021-01-01 23:00:05 - Surname2 SurnameBis Name2 (WorkNotes)
What?
I don't know?
2021-01-02 03:00:05 - Surname1 Name1 (Comment)
Blablabla!'''.splitlines()

first_line_patt = re.compile(r'^(\d{4}-\d{2}-\d{2}) (\d{2}:\d{2}:\d{2}) - (.*)(?= \() \((.*)\)$')


def parse_block(lines, idx):
    # parse the meta line
    res = first_line_patt.findall(lines[idx])

    # get the message
    message = []
    while idx < len(lines)-1:
        line = lines[idx   1]
        idx  = 1

        # check if next line is a meta line
        if first_line_patt.match(line):
            break

        # if not, it is a message line
        message.append(line)

    res.append('\n'.join(message))
    return res, idx


idx = 0
while True:
    res, idx = parse_block(data, idx)
    if not res[0]:
        break
    print(res)

This produces the following result:

[('2021-01-01', '10:00:05', 'Surname1 Name1', 'Comment'), 'Blablabla\nBlabla']
[('2021-01-01', '23:00:05', 'Surname2 SurnameBis Name2', 'WorkNotes'), "What?\nI don't know?"]
[('2021-01-02', '03:00:05', 'Surname1 Name1', 'Comment'), 'Blablabla!']
  • Related