Regex in Python - why do I get a NoneType not subscriptable?-CodePudding

I am really pulling my hair out on this one, trying to get regex to work in Python.

Essentially there is a logfile which I am trying to iterate across. In which, there is either an "INFO" message or "ERROR" message. I am trying to used grouped objects to extract a few information including: (1) Whether its an INFO or ERROR message, (2) The detailed message, (3) The log-number, and (4) The username of each log record.

Here's a snippet of my test data:

for i in temp1[:5]:
    print(i)

Output:

Jan 31 00:16:25 ubuntu.local ticky: INFO Closed ticket [#1754] (noel)
Jan 31 00:21:30 ubuntu.local ticky: ERROR The ticket was modified while updating (breee)
Jan 31 00:44:34 ubuntu.local ticky: ERROR Permission denied while closing ticket (ac)
Jan 31 01:00:50 ubuntu.local ticky: INFO Commented on ticket [#4709] (blossom)
Jan 31 01:29:16 ubuntu.local ticky: INFO Commented on ticket [#6518] (rr.robinson)

When I try to search, I get a traceback:

for i in temp1[2:3]:
  individualLines = re.search(r"ticky: (INFO|ERROR) ([\w ']*) ([\[[#0-9]*\]?]?) \(([\w .]*)\)\n$",i)

>> print(individualLines[4])

    ---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
<ipython-input-204-3afe859ffffb> in <module>
      1 for i in temp1[2:3]:
      2     individualLines = re.search(r"ticky: (INFO|ERROR) ([\w ']*) ([\[[#0-9]*\]?]?) \(([\w .]*)\)\n$",i)
      3 print(individualLines[4])

TypeError: 'NoneType' object is not subscriptable

In the first box above, I've print an example of how each line in the log file would look like. In the second box, you can see the regex that I am try to use. The main issue is that the 3rd item in the grouped objects (ie: the log-number) does not exist for some of the lines. But somehow, I can't get this to work.

If I run it only on the first line, it will turn out just fine as per the subsequent excerpts. But once it iterates through to a line without the log-number, there appears to be an issue which I cannot figure out. Is this error related to the way I am declaring my regular expressions?

for i in temp1[:1]:
    individualLines = re.search(r"ticky: (INFO|ERROR) ([\w ']*) ([\[[#0-9]*\]?]?) \(([\w .]*)\)\n$",i)
    >> print(individualLines[1])
    >> print(individualLines[2])
    >> print(individualLines[3])
    >> print(individualLines[4])

    INFO
    Closed ticket
    [#1754]
    noel

Some added contextual information, part of the code is trying keep a record of the occurrence of each unique types of error messages as seen in the code below. There is also another portion that keeps track of the unique users and the number of error or info messages that they've generated in the log file (not included here). However, the code fails to run and I've figured it has to do with the regular expression, hence I only included the regex portion in the initial question when posting.

#Initialize dictionaries
errorONLY = {}

for lines in temp1:
    individualLines = re.search(r"ticky: (INFO|ERROR) ([\w ']*) ([\[[#0-9]*\]?]?) \(([\w .]*)\)\n$",i)

if individualLines[1] == "ERROR":
    if individualLines[2] not in errorONLY:
        errorONLY[individualLines[2]] = 1 
    errorONLY[individualLines[2]]  = 1

Additional information (2): I made a lazy and temporary fix whereby I omitted the log-number from inclusion into the matched group. I forsee another issue when trying access a matched group for the log-number in a line in the log file that does not have a valid log number that I may have to rectify with another forloop. Anyway I will just depict my strategy, but feel free to comment otherwise, I am still pretty much learning.

I first re-wrote everything from scratch, trying to match only one group per iteration.

for i in temp1[:2]:
    individualLines= re.search(r"ticky: (INFO|ERROR) .*", i)
    print(individualLines[1])
#    print(individualLines[2])
#    print(individualLines[3])

>> INFO
>> ERROR

Which then evolved to:

for i in temp1[:2]:
    individualLines= re.search(r"ticky: (INFO|ERROR) ([\w ]*) .*", i)
    print(individualLines[1])
    print(individualLines[2])
#    print(individualLines[3])

>> INFO
>> Closed ticket
>> ERROR
>> The ticket was modified while updating

And finally:

for i in temp1[:2]:
    individualLines= re.search(r"ticky: (INFO|ERROR) ([\w ]*) .* \(([\w .]*)\)", i)
    print(individualLines[1])
    print(individualLines[2])
    print(individualLines[3])

>> INFO
>> Closed ticket
>> noel
>> ERROR
>> The ticket was modified while
>> breee

I find that this would be much more efficient when matching multiple groups, since I can immediately pinpoint which regular expression was causing the issue, as opposed to trying to figure out one by one. Just something maybe others can take note of? Also, I made some edits as per the suggestion in the comments section.

CodePudding user response：

The error message means that the regular expression didn't find a match. In that case re.search returns None instead of a Match object.

There are these issues in the regex:

\n$: this \n is probably not present in your data, which apparently is already split into lines without terminating \n. So this \n should just be removed from the regex.
([\[[#0-9]*\]?]?): the first square bracket seems intended to open a group or something, but that is actually the beginning of a character class. The corresponding closing ]? seems intended to make that group optional, but it makes a single character (allowed by the character class) optional. Remove that pair of brackets, and make the real capture group optional.
Related to the previous point: the literal closing bracket is made optional with \]?, but if there was an opening bracket, the closing one is not optional, so this question mark should be removed.
Related to the previous points: when this bracketed part does not occur in the line, then there will also be one less space, so you need to include a space in that optional group. Like this ( \[[#0-9]*\])?. As you probably expect the hash symbol as the first character in that expression, you could also use ( \[#\d*\])?

Here is the regex with those corrections:

 r"ticky: (INFO|ERROR) ([\w ']*)( \[#\d*\])? \(([\w .]*)\)$"