I want to extract a list from another list which is a list of URLs. For example,
| index | URL |
| -------- | -------------------------------------------------- |
| 1 | 'http://www.exmaples.com/some.html/' |
| 2 | 'https://www.exmaples.com/some.jpg/ ' |
| 3 | 'mailto://[email protected]' |
| 4 | 'mailto://[email protected]' |
| 5 | 'http://www.exmaples.com/menu1/' |
| 6 | 'http://www.exmaples.com/menu2/' |
| 7 | 'http://www.exmaples.com/menu3/' |
| 8 | 'http://www.exmaples.com/menu4/' |
| 9 | 'http://www.exmaples.com/menu5/submenu1.html' |
| 10 | 'http://www.exmaples.com/menu6/submenu3.pdf' |
| 11 | 'http://www.exmaples.com/menu6/submenu4/list.png' |
I want to remove the ones that contain the following:
avoid_list =['mailto', '@', '.jpg', '.png', '.pdf']
For example, I've used a list comprehension list like the one below. But sometimes it ignores and gives some elements which contain avoids.
[url for url in urls for avoid in avoid_list if avoid not in url]
My question is if there is any python library for handling URLs and filtering them based on some conditions.
I appreciate your consideration in advance:)
CodePudding user response:
You could join the avoid list to a string with |
as delimiter (which works as "OR" in that case) and use str.contains
to check each row if it contains any element of the list.
out = df[~df['URL'].str.contains('|'.join(avoid_list))]
print(out
index URL
0 1 'http://www.exmaples.com/some.html/'
4 5 'http://www.exmaples.com/menu1/'
5 6 'http://www.exmaples.com/menu2/'
6 7 'http://www.exmaples.com/menu3/'
7 8 'http://www.exmaples.com/menu4/'
8 9 'http://www.exmaples.com/menu5/submenu1.html'