I have a code for an exercise of regexes and a dictionary. I have to create a list of dictiionaries. I did and it's correct, however, when testing one item that should be inside the dictionary, the assignment says that it's not when everything I see it's correct. The time value of that item seems that it's not on the dictionary when it is. I will show my code and then the code that asses it and the messaage of error.
import re
def logs():
with open("assets/logdata.txt", "r") as file:
logdata = file.read()
# YOUR CODE HERE
logdatalines = logdata.splitlines()
logs = []
for line in logdatalines:
dictionary = to_dictionary(line)
logs.append(dictionary)
return logs
def to_dictionary(line):
dictionary = {}
dictionary["host"] = re.findall("[0-9] \.[0-9] \.[0-9] \.[0-9] ", line)
dictionary["user_name"] = re.findall("-* - [a-z]*[0-9]*", line)
dictionary["time"] = re.findall("[0-9] \/[A-Z][a-z]*\/[0-9] \:[0-9] \:[0-9] \:[0-9] -[0-9] ", line)
dictionary["request"] = re.findall("[A-Z] .*", line)
return dictionary
And then we have this assert code to test the code:
assert len(logs()) == 979
one_item={'host': '146.204.224.152',
'user_name': 'feest6811',
'time': '21/Jun/2019:15:45:24 -0700',
'request': 'POST /incentivize HTTP/1.1'}
assert one_item in logs(), "Sorry, this item should be in the log results, check your formating"
The text file is something like this but with more data (each line ends after the GET(or other)/...":
146.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622
197.109.77.178 - kertzmann3129 [21/Jun/2019:15:45:25 -0700] "DELETE /virtual/solutions/target/web services HTTP/2.0" 203 26554
156.127.178.177 - okuneva5222 [21/Jun/2019:15:45:27 -0700] "DELETE /interactive/transparent/niches/revolutionize HTTP/1.1" 416 14701
100.32.205.59 - ortiz8891 [21/Jun/2019:15:45:28 -0700] "PATCH /architectures HTTP/1.0" 204 6048
168.95.156.240 - stark2413 [21/Jun/2019:15:45:31 -0700] "GET /engage HTTP/2.0" 201 9645
71.172.239.195 - dooley1853 [21/Jun/2019:15:45:32 -0700] "PUT /cutting-edge HTTP/2.0" 406 24498
180.95.121.94 - mohr6893 [21/Jun/2019:15:45:34 -0700] "PATCH /extensible/reinvent HTTP/1.1" 201 27330
144.23.247.108 - auer7552 [21/Jun/2019:15:45:35 -0700] "POST /extensible/infrastructures/one-to-one/enterprise HTTP/1.1" 100 22921
2.179.103.97 - lind8584 [21/Jun/2019:15:45:36 -0700] "POST /grow/front-end/e-commerce/robust HTTP/2.0" 304 14641
241.114.184.133 - tromp8355 [21/Jun/2019:15:45:37 -0700] "GET /redefine/orchestrate HTTP/1.0" 204 29059
224.188.38.4 - keebler1423 [21/Jun/2019:15:45:40 -0700] "PUT /orchestrate/out-of-the-box/unleash/syndicate HTTP/1.1" 404 28211
94.11.36.112 - klein8508 [21/Jun/2019:15:45:41 -0700] "POST /enhance/solutions/bricks-and-clicks HTTP/1.1" 404 24768
126.196.238.197 - gusikowski9864 [21/Jun/2019:15:45:45 -0700] "DELETE /rich/reinvent HTTP/2.0" 405 7894
103.247.168.212 - medhurst2732 [21/Jun/2019:15:45:49 -0700] "HEAD /scale/global/leverage HTTP/1.0" 203 15844
57.86.153.68 - dubuque8645 [21/Jun/2019:15:45:50 -0700] "POST /innovative/roi/robust/systems HTTP/1.1" 406 29046
231.220.8.214 - luettgen1860 [21/Jun/2019:15:45:52 -0700] "HEAD /systems/sexy HTTP/1.1" 201 2578
CodePudding user response:
You are using re.findall()
which returns a list of matched items. Assuming there is only one match per line it would return:
['21/Jun/2019:15:45:24 -0700']
This clearly is not the same as:
'21/Jun/2019:15:45:24 -0700'
... which is what you are trying to match.
If you are going to use method findall
, then extract the first element from the returned list:
dictionary["time"] = re.findall("[0-9] \/[A-Z][a-z]*\/[0-9] \:[0-9] \:[0-9] \:[0-9] -[0-9] ", line)[0]
And do that for every dictionary item. By the way, the above regex as unnecessary escape characters, i.e. \
. Just use:
dictionary["time"] = re.findall("[0-9] /[A-Z][a-z]*/[0-9] :[0-9] :[0-9] :[0-9] -[0-9] ", line)[0]
But better would just to use method search
:
dictionary["time"] = re.search("[0-9] /[A-Z][a-z]*/[0-9] :[0-9] :[0-9] :[0-9] -[0-9] ", line).group(0)
A simpler regex would be:
dictionary["time"] = re.search(r'\[([^\]] )\]', line).group(1)
This just matches whatever is between open and closed brackets ([]
) as capture group 1.
Update
I would use a single regular expression with the match
method to capture all 4 values into capture groups 1 through 4:
((?:[0-9] \.){3}[0-9] ) - -*([a-z] [0-9] )* \[([^\]] )\] "([^"] )"
(
- start of capture group 1.(?:[0-9] \.){3}
- matches one or more digits followed by '.' three times (no capture group).[0-9]
- matches one or more digits.)
- end of capture group 1.-
- matches ' - '.-*
- matches 0 or more '-'.([a-z] [0-9] )*
- matches name in capture group 2, which may not be present.\[
- matches ' ['.([^\]] )
- matches one or more non-']' characters. 10.\] "
- matches '] "'.([^"] )
- matches one or more non-'"' characters."
- matches '"'.
The code:
import re
line = '46.204.224.152 - feest6811 [21/Jun/2019:15:45:24 -0700] "POST /incentivize HTTP/1.1" 302 4622'
def to_dictionary(line):
dictionary = {}
m = re.match(r'((?:[0-9] \.){3}[0-9] ) - [- ]*([a-z] [0-9] )* \[([^\]] )\] "([^"] )"', line)
if m:
dictionary["host"] = m.group(1)
dictionary["user_name"] = m.group(2)
dictionary["time"] = m.group(3)
dictionary["request"] = m.group(4)
else:
print('Could not match:', line)
return dictionary
print(to_dictionary(line))
Prints:
{'host': '46.204.224.152', 'user_name': 'feest6811', 'time': '21/Jun/2019:15:45:24 -0700', 'request': 'POST /incentivize HTTP/1.1'}
CodePudding user response:
Final code
logdatalines = logdata.splitlines()
logs = []
for line in logdatalines:
dictionary = to_dictionary(line)
logs.append(dictionary)
return logs
#return dictionary
def to_dictionary(line):
dictionary = {}
m = re.match(r'((?:[0-9] \.){3}[0-9] ) - ([- ]*[a-z]*[0-9]*) \[([^\]] )\] "(.*)"', line)
assert m
dictionary["host"] = m.group(1)
dictionary["user_name"] = m.group(2)
dictionary["time"] = m.group(3)
dictionary["request"] = m.group(4)
return dictionary