Regex combining 2 complete lines of call data-CodePudding

I am preparing a python script that simply grabs call records from a .txt log file based on a user inputted phone number.

I need help writing a regex that captures each call log so it can be "split" into a list. Each log starts with the letters e.g. (L,N,S,E). At the begging of each log that is a time stamp that starts with a "T". Example: T 02/25 00:00

Records that start with (N,S,E) have a second line that need to be include. Alternatively, (L) records do not, but they do have an empty space below which could be included.

Essentially, what I need is every record that starts with (L,N,S,E) and the line bellow it.

See call log sample bellow

T                        02/25 00:00 
L 065 00 24329   12313   244.0.15.55 252.9.11.90 02/25 08:05 00:00:44 0000 0000 
                                                                
N 066 00 23442   T000185 262.1.00.09 02/25 08:05 00:00:02 A 16630
&       0000    0000                                      
S 067 00 00984   T000134             02/25 08:06 00:00:02 A 61445
&       0000    0000                                      
S 068 00 T000002 29536               02/25 08:05 00:00:36 
&       0000    0000   1234567890XXXXXX                   
E 069 00 T000002 T000185             02/25 08:06 00:00:00 
&       0000    0000   1234567890XXXXXX

For example:

L 065 00 24329   12313   244.0.15.55 252.9.11.90 02/25 08:05 00:00:44 0000 0000

would be one match and

N 066 00 23442   T000185 262.1.00.09 02/25 08:05 00:00:02 A 16630
&       0000    0000

would be another,

S 068 00 T000002 29536               02/25 08:05 00:00:36 
&       0000    0000   1234567890XXXXXX

would be another, and so on...

My regex knowledge is limited, but this is what I have come up with so far.

N(.*)\n(.*)

This selects the first and second line of each record that starts with "N", but it puts them in separate groups. Any direction is appreciated.

CodePudding user response：

To match any of those 4 characters, you can use a character class [LNSE] followed by a space and the rest of the line.

Then in the same capture group use an optional part to match a newline and the rest of the line if it does not start with any of those characters.

This allows for also matching only a single line.

^[LNSE] (.*(?:\n(?![LNSE] ).*)?)

^ Start of string
[LNSE] Match any of the listed characters and a space
( Capture group 1
- .* match the rest of the line
- (?: Non capture group
  - \n(?![LNSE] ).* Match a newline and the rest of the line if it does not start with any of the listed characters using a negative lookahead
- )? Close the non capture group and make it optional
) Close the non capture group

See a regex demo and a Python demo.

If you want to match 0 or more following lines, you can change the ? to * for zero or more repetitions.
If you want to match the whole line including the leading characters, you can omit the capture group and just get the match.

For example, getting the capture group 1 value using re.findall while reading the whole file:

import re

with open('file.log', 'r') as file:
    regex = r"^[LNSE] (.*(?:\n(?![LNSE] ).*)*)"
    print(re.findall(regex, file.read(), re.MULTILINE))

CodePudding user response：

Here is another way to do so without regex:

data = []
with open('file.log', 'r') as f:
    next(f)
    for line in f:
        if line[0] in ("L", "N", "S", "E"):
            data.append(line)
        else:
            data[-1]  = line
print(data)

next(f) is used to skip the first line (T 02/25 00:00).

Assuming that all logs are two lines long, one can also use a list comprehension:

with open('file.log', 'r') as f:
    next(f)
    lines = f.readlines()
    data = [''.join(lines[i:i 2]) for i in range(0, len(lines), 2)]
print(data)

Each line is stored in a list lines (except the first line, which has been skipped). The list comprehension then joins elements of the list two by two.

The same if the log file is stored in a variable:

lines = text.split("\n")[1:]  # [-1:] to skip the first line
data = [''.join(lines[i:i 2]) for i in range(0, len(lines), 2)]
print(data)

Output:

[
    'L 065 00 24329   12313   244.0.15.55 252.9.11.90 02/25 08:05 00:00:44 0000 0000\n                                                                \n', 
    'N 066 00 23442   T000185 262.1.00.09 02/25 08:05 00:00:02 A 16630\n&       0000    0000                                      \n', 
    'S 067 00 00984   T000134             02/25 08:06 00:00:02 A 61445\n&       0000    0000                                      \n', 
    'S 068 00 T000002 29536               02/25 08:05 00:00:36 \n&       0000    0000   1234567890XXXXXX                   \n', 
    'E 069 00 T000002 T000185             02/25 08:06 00:00:00 \n&       0000    0000   1234567890XXXXXX'
]