Home > Software design >  Filtering list of Urls based on condition
Filtering list of Urls based on condition

Time:10-07

I want to extract a list from another list which is a list of URLs. For example,

| index    | URL                                                |
| -------- | -------------------------------------------------- |
| 1        | 'http://www.exmaples.com/some.html/'               |
| 2        | 'https://www.exmaples.com/some.jpg/ '              |
| 3        | 'mailto://[email protected]'                       |
| 4        | 'mailto://[email protected]'                      |
| 5        | 'http://www.exmaples.com/menu1/'                   |
| 6        | 'http://www.exmaples.com/menu2/'                   |
| 7        | 'http://www.exmaples.com/menu3/'                   |
| 8        | 'http://www.exmaples.com/menu4/'                   |
| 9        | 'http://www.exmaples.com/menu5/submenu1.html'      |
| 10       | 'http://www.exmaples.com/menu6/submenu3.pdf'       |
| 11       | 'http://www.exmaples.com/menu6/submenu4/list.png'  |

I want to remove the ones that contain the following: avoid_list =['mailto', '@', '.jpg', '.png', '.pdf'] For example, I've used a list comprehension list like the one below. But sometimes it ignores and gives some elements which contain avoids. [url for url in urls for avoid in avoid_list if avoid not in url] My question is if there is any python library for handling URLs and filtering them based on some conditions.

I appreciate your consideration in advance:)

CodePudding user response:

You could join the avoid list to a string with | as delimiter (which works as "OR" in that case) and use str.contains to check each row if it contains any element of the list.

out = df[~df['URL'].str.contains('|'.join(avoid_list))]
print(out
   index                                            URL
0      1           'http://www.exmaples.com/some.html/'
4      5               'http://www.exmaples.com/menu1/'
5      6               'http://www.exmaples.com/menu2/'
6      7               'http://www.exmaples.com/menu3/'
7      8               'http://www.exmaples.com/menu4/'
8      9  'http://www.exmaples.com/menu5/submenu1.html'
  • Related