Home > database >  Python - How do I Remove String from List if Substring is a match (regex)?
Python - How do I Remove String from List if Substring is a match (regex)?

Time:09-07

Apologies if this has been asked before but im wrecking my head over it and I've googled for hours on this one trying to see if there is a similar solution.

I've a list of url's in which the last 6 characters within '/' '/' are digits eg: www.test.com/nothere/432432/

I'm trying to write the code so that if there is a match to the substring in the position its in in the string it doesnt get added to the list. The url's im "looking at" are all of the same format hence the use of the regex in the example.

I've tried various if re.match if re.search etc etc and nothing i can put together seems to work.

This is my latest attempt:

list = ['www.test.com/nothere/432432/', 'www.test.com/nothere/685985/', 'www.test.com/nothere/655985/', 'www.test.com/nothere/112113/']

regex = re.compile(r'(/\d{6}/)')
filtered = [i for i in list if not regex.match(i)]
print(filtered)

My understanding for this is that if the regex.match(i) is not triggered then the item gets added. Otherwise dont. But that is clearly not the case and it adds them all irregardless :/

Any and all help is appriciated.

Thanks!

EDIT

Another version ive tried which does nothing:

            regex = re.match(r'(/\d{6}/)', Adlink) in allAdLinks
            if regex:
                allAdLinks.remove(Adlink)
                print(allAdLinks)
            else:
                print("try again")
                continue

CodePudding user response:

IIUC, you want to remove all entries from your list where the final 6 digits have already been seen in another url in the list. You can do that by processing the list, keeping the page only if its last 6 digits are not in the set (and adding them to the set in that case):

urls = [
 'www.test.com/nothere/432432/',
 'www.test.com/nothere/685985/',
 'test.com/1604350/169408',
 'www.test.com/nothere/655985/',
 'www.test.com/nothere/112113/',
 'test.com/1602436/169408',
 'www.test.com/another/685985/'
]
pages = set()
result = []
for url in urls:
    num = re.search(r'\d{6}/?$', url)
    if num is not None and num.group() not in pages:
         result.append(url)
         pages.add(num.group())

print(result)

Output:

[
 'www.test.com/nothere/432432/',
 'www.test.com/nothere/685985/',
 'test.com/1604350/169408',
 'www.test.com/nothere/655985/',
 'www.test.com/nothere/112113/'
]
  • Related