Why it's not correct? Regex in dictionary, value not found even though should be inside-CodePudding

I have a code for an exercise of regexes and a dictionary. I have to create a list of dictiionaries. I did and it's correct, however, when testing one item that should be inside the dictionary, the assignment says that it's not when everything I see it's correct. The time value of that item seems that it's not on the dictionary when it is. I will show my code and then the code that asses it and the messaage of error.

import re

def logs():
    with open("assets/logdata.txt", "r") as file:
        logdata = file.read()
    
    # YOUR CODE HERE
    logdatalines = logdata.splitlines()
    logs = []
    for line in logdatalines:
        dictionary = to_dictionary(line)
        logs.append(dictionary)
    return logs


def to_dictionary(line):
    dictionary = {}
    dictionary["host"] = re.findall("[0-9] \.[0-9] \.[0-9] \.[0-9] ", line)
    dictionary["user_name"] = re.findall("-* -  [a-z]*[0-9]*", line)
    dictionary["time"] = re.findall("[0-9] \/[A-Z][a-z]*\/[0-9] \:[0-9] \:[0-9] \:[0-9]  -[0-9] ", line)
    dictionary["request"] = re.findall("[A-Z]  .*", line)
    return dictionary

And then we have this assert code to test the code:

assert len(logs()) == 979

one_item={'host': '146.204.224.152',
  'user_name': 'feest6811',
  'time': '21/Jun/2019:15:45:24 -0700',
  'request': 'POST /incentivize HTTP/1.1'}
assert one_item in logs(), "Sorry, this item should be in the log results, check your formating"

The text file is something like this but with more data (each line ends after the GET(or other)/...":

146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048
168.95.156.240 - stark2413 [21/Jun/2019:15:45:31 -0700] "GET /engage HTTP/2.0" 201 9645
71.172.239.195 - dooley1853 [21/Jun/2019:15:45:32 -0700] "PUT /cutting-edge HTTP/2.0" 406 24498
180.95.121.94 - mohr6893 [21/Jun/2019:15:45:34 -0700] "PATCH /extensible/reinvent HTTP/1.1" 201 27330
144.23.247.108 - auer7552 [21/Jun/2019:15:45:35 -0700] "POST /extensible/infrastructures/one-to-one/enterprise HTTP/1.1" 100 22921
2.179.103.97 - lind8584 [21/Jun/2019:15:45:36 -0700] "POST /grow/front-end/e-commerce/robust HTTP/2.0" 304 14641
241.114.184.133 - tromp8355 [21/Jun/2019:15:45:37 -0700] "GET /redefine/orchestrate HTTP/1.0" 204 29059
224.188.38.4 - keebler1423 [21/Jun/2019:15:45:40 -0700] "PUT /orchestrate/out-of-the-box/unleash/syndicate HTTP/1.1" 404 28211
94.11.36.112 - klein8508 [21/Jun/2019:15:45:41 -0700] "POST /enhance/solutions/bricks-and-clicks HTTP/1.1" 404 24768
126.196.238.197 - gusikowski9864 [21/Jun/2019:15:45:45 -0700] "DELETE /rich/reinvent HTTP/2.0" 405 7894
103.247.168.212 - medhurst2732 [21/Jun/2019:15:45:49 -0700] "HEAD /scale/global/leverage HTTP/1.0" 203 15844
57.86.153.68 - dubuque8645 [21/Jun/2019:15:45:50 -0700] "POST /innovative/roi/robust/systems HTTP/1.1" 406 29046
231.220.8.214 - luettgen1860 [21/Jun/2019:15:45:52 -0700] "HEAD /systems/sexy HTTP/1.1" 201 2578

CodePudding user response：

You are using re.findall() which returns a list of matched items. Assuming there is only one match per line it would return:

['21/Jun/2019:15:45:24 -0700']

This clearly is not the same as:

'21/Jun/2019:15:45:24 -0700'

... which is what you are trying to match.

If you are going to use method findall, then extract the first element from the returned list:

    dictionary["time"] = re.findall("[0-9] \/[A-Z][a-z]*\/[0-9] \:[0-9] \:[0-9] \:[0-9]  -[0-9] ", line)[0]

And do that for every dictionary item. By the way, the above regex as unnecessary escape characters, i.e. \. Just use:

    dictionary["time"] = re.findall("[0-9] /[A-Z][a-z]*/[0-9] :[0-9] :[0-9] :[0-9]  -[0-9] ", line)[0]

But better would just to use method search:

    dictionary["time"] = re.search("[0-9] /[A-Z][a-z]*/[0-9] :[0-9] :[0-9] :[0-9]  -[0-9] ", line).group(0)

A simpler regex would be:


    dictionary["time"] = re.search(r'\[([^\]] )\]', line).group(1)

This just matches whatever is between open and closed brackets ([]) as capture group 1.

Update

I would use a single regular expression with the match method to capture all 4 values into capture groups 1 through 4:

((?:[0-9] \.){3}[0-9] ) - -*([a-z] [0-9] )* \[([^\]] )\] "([^"] )"

( - start of capture group 1.
(?:[0-9] \.){3} - matches one or more digits followed by '.' three times (no capture group).
[0-9] - matches one or more digits.
) - end of capture group 1.
- - matches ' - '.
-* - matches 0 or more '-'.
([a-z] [0-9] )* - matches name in capture group 2, which may not be present.
\[ - matches ' ['.
([^\]] ) - matches one or more non-']' characters. 10.\] " - matches '] "'.
([^"] ) - matches one or more non-'"' characters.
" - matches '"'.

The code:

import re

line = '46.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622'

def to_dictionary(line):
    dictionary = {}
    m = re.match(r'((?:[0-9] \.){3}[0-9] ) - [- ]*([a-z] [0-9] )* \[([^\]] )\] "([^"] )"', line)
    if m:
        dictionary["host"] = m.group(1)
        dictionary["user_name"] = m.group(2)
        dictionary["time"] = m.group(3)
        dictionary["request"] = m.group(4)
    else:
        print('Could not match:', line)
    return dictionary

print(to_dictionary(line))

Prints:

{'host': '46.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}

CodePudding user response：

Final code

    logdatalines = logdata.splitlines()

    logs = []
    for line in logdatalines:
        dictionary = to_dictionary(line)
        logs.append(dictionary)
    return logs


    #return dictionary

def to_dictionary(line):
    dictionary = {}
    m = re.match(r'((?:[0-9] \.){3}[0-9] ) - ([- ]*[a-z]*[0-9]*) \[([^\]] )\] "(.*)"', line)
    assert m
    dictionary["host"] = m.group(1)
    dictionary["user_name"] = m.group(2)
    dictionary["time"] = m.group(3)
    dictionary["request"] = m.group(4)
    return dictionary