Home > database >  Python regex for excluding strings including a specific word
Python regex for excluding strings including a specific word

Time:03-17

I am trying to use a regex to exclude disambiguation pages when scraping wikipedia. I looked around for tips about using the negative lookahead and I cannot seem to make it work. I think I am missing something fundamental about its use but as of now I am totally clueless. Could someone please point me in the right direction? (I don't want to use if 'disambiguation' in y , I am trying to grasp the workings of the negative lookahead.) Thank you. Here is the code:

list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
  '/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
 '/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
  '/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
 '/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']

def findString(string):
  regex1 = r'(/wiki/)(_\($)(!?disambiguation)'
  for x in list_links:
      y =  re.findall(regex1, x)
      print(y)

findString(list_links)```

CodePudding user response:

You can use one of the regex, based on your need. Also, I have added some changes to the function definition to respect PEP.

def remove_disambiguation_link(list_of_links):
    regex = "(.*)\((!?disambiguation)\)"
    # regex = "(/wiki/)(.*)\((!?disambiguation)\)"
    # return [links for links in list_of_links if not re.search(regex, links)]
    return list(filter(lambda link: not re.search(regex, link), list_of_links))
list_links = remove_disambiguation_link(list_links)
print(list_links)
[
    "/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg",
    "/wiki/Taiwanese_tea",
    "/wiki/Tung-ting_tea",
    "/wiki/Nantou_County",
    "/wiki/Taiwan",
    "/wiki/Dongfang_Meiren",
    "/wiki/Alishan_National_Scenic_Area",
    "/wiki/Chiayi_County",
    "/wiki/Dayuling",
    "/wiki/Baozhong_tea",
    "/wiki/Pinglin_Township",
]

CodePudding user response:

For your case the simplest solution would just be not using regex for that... just do something like:

list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
  '/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
 '/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
  '/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
 '/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']

def findString(string):
  regex1 = r'(/wiki/)(_\($)'
  for x in string:
      if 'disambiguation' in x:
          continue  # skip
      y =  re.findall(regex1, x)
      print(y)

findString(list_links)

CodePudding user response:

You do not need to use regex. You can iterate through list_links and check if the string you are looking for, 'disambiguation` is in each item in list_links.

list_links = ['/wiki/Oolong_(disambiguation)', '/wiki/File:Mi_Lan_Xiang_Oolong_Tea_cropped.jpg',
  '/wiki/Taiwanese_tea', '/wiki/Tung-ting_tea',
 '/wiki/Nantou_County', '/wiki/Taiwan', '/wiki/Dongfang_Meiren',
  '/wiki/Alishan_National_Scenic_Area', '/wiki/Chiayi_County',
 '/wiki/Dayuling', '/wiki/Baozhong_tea', '/wiki/Pinglin_Township']

to_find = 'disambiguation'

def findString(list_links):
    for link in list_links:
        if to_find in link:
            # get indice of match
            match_index = list_links.index(link)
            # remove match from list
            list_links.pop(match_index)
    # print new list without 'disambiguation' items
    print(list_links)        

findString(list_links)

  • Related