I am trying to perform BFS on Wikipedia pages. I believe I am implementing this correctly and in the best possible way run-time wise (keeping it to one thread), but it is taking quite a long time to find a connection between two articles. Here is my implementation:
marked = set()
queue = deque()
count = 0
def get_wikipedia_page(wiki_link):
url = BASE_URL wiki_link
time.sleep(0.1)
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
return soup
def backtrace(parent, start, end):
boolean = False
ret = []
while boolean == False:
if end in parent:
ret.append(parent[end])
end = parent[end]
if end == start:
break;
ret.reverse()
return ret
def bfs(first_wiki_link, second_wiki_link):
print('Searching starting from ' first_wiki_link ' looking for ' second_wiki_link '...')
parent = {}
queue.append(first_wiki_link)
start_time = time.time()
#current_parent = first_wiki_link
while queue:
link = queue.pop()
current_parent = link
link_list = list(filter(lambda c: c.get('href') != None and c.get('href').startswith('/wiki/') and c.get('href') != '/wiki/Main_Page' and ':' not in c.get('href'), get_wikipedia_page(link).findAll('a')))
for link in link_list:
href = link.get('href')
if not marked.__contains__(href):
parent[href] = current_parent
marked.add(href)
queue.append(href)
if href == second_wiki_link:
ret = backtrace(parent, first_wiki_link, second_wiki_link)
print("connection found")
end_time = time.time()
print("found a connection in: " str(end_time - start_time) " seconds.")
print("the path is " str(len(ret)) " pages long")
print(ret)
return True
It takes, sometimes, a few minutes before finding a match. Is this expected due to how big wikipedia is? Or am I messing something up here and am performing it in a non optimal way? Or, could it be beautiful soup is running too slow?
CodePudding user response:
You are doing a dfs effectively, not a bfs. A dfs in a large graph like wikipedia content can take a very long time to look at the close neighbors of a node and will in most likeliness lead to longer times to find a connection, because chances are that those 2 pages are somehow related and thus close.
This part of the code:
while queue:
link = queue.pop() # <---- look at this
current_parent = link
link_list = list(filter...
for link in link_list:
This makes it a dfs because you are using the queue
like a stack. A bfs needs a FIFO (first in first out) data structure, while you are doing a LIFO (last in first out), which is what dfs does.
This means that the first neighbor of the first page is looked at after you have looked at each and every page on wikipedia.
To solve it:
while queue:
link = queue.popleft() # <---- look at this
current_parent = link
link_list = list(filter...
for link in link_list:
This will look at the neighbors first and go forward breadth first search like you intend it to do.