I am trying to use a regex to exclude disambiguation pages when scraping wikipedia. I looked around for tips about using the negative lookahead and I cannot seem to make it work. I think I am missing something fundamental about its use but as of now I am totally clueless. Could someone please point me in the right direction? (I don't want to use if 'disambiguation' in y , I am trying to grasp the workings of the negative lookahead.) Thank you. Here is the code:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)(!?disambiguation)'
for x in list_links:
y = re.findall(regex1, x)
print(y)
findString(list_links)```
CodePudding user response:
You can use one of the regex, based on your need. Also, I have added some changes to the function definition to respect PEP.
def remove_disambiguation_link(list_of_links):
regex = "(.*)\((!?disambiguation)\)"
# regex = "(/wiki/)(.*)\((!?disambiguation)\)"
# return [links for links in list_of_links if not re.search(regex, links)]
return list(filter(lambda link: not re.search(regex, link), list_of_links))
list_links = remove_disambiguation_link(list_links)
print(list_links)
[
"/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg",
"/wiki/Taiwanese_tea",
"/wiki/Tung-ting_tea",
"/wiki/Nantou_County",
"/wiki/Taiwan",
"/wiki/Dongfang_Meiren",
"/wiki/Alishan_National_Scenic_Area",
"/wiki/Chiayi_County",
"/wiki/Dayuling",
"/wiki/Baozhong_tea",
"/wiki/Pinglin_Township",
]
CodePudding user response:
For your case the simplest solution would just be not using regex for that... just do something like:
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
def findString(string):
regex1 = r'(/wiki/)(_\($)'
for x in string:
if 'disambiguation' in x:
continue # skip
y = re.findall(regex1, x)
print(y)
findString(list_links)
CodePudding user response:
You do not need to use regex. You can iterate through list_links
and check if the string you are looking for, 'disambiguation` is in each item in list_links
.
list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
'/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
'/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
'/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
'/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']
to_find = 'disambiguation'
def findString(list_links):
for link in list_links:
if to_find in link:
# get indice of match
match_index = list_links.index(link)
# remove match from list
list_links.pop(match_index)
# print new list without 'disambiguation' items
print(list_links)
findString(list_links)