Filtering through a list of results in Python-CodePudding

Brand new to learning Python, but very familiar with Google Sheets -- I'm essentially trying to mimic the "filter" function but cannot find anything on it.

The goal of my script is to pull the social media tags of NBA players (from the URLs).

I have it working to pull all links, but want to clean up my code so basically there's an if statement saying

If my results contain (https://www.facebook.com"), (https://www.twitter.com") or (https://www.instagram.com"), that would be the only info pulled.

Right now, it looks more like this:

code results

It isn't the end of the world, because I can paste into a Google Sheet and clean, but it would be really nice to learn something like this.

    from bs4 import BeautifulSoup
import requests


def get_profile(url):

    profiles = []

    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    container = soup.find('div', attrs={'class', 'main-container'})

    for profile in container.find_all('a'):

        profiles.append(profile.get('href'))

    for profile in profiles:
        print(profile)


get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452')
get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250')

CodePudding user response：

You can use the in keyword to search for substrings. In your case, you could check each profile like so:

if "https://www.facebook.com" in profile:
    print(profile)

in returns True if it finds the substring.

CodePudding user response：

You can search a list to check if any of the items exist in the specific href you're checking like so:

from bs4 import BeautifulSoup
import requests


def get_profile(url):

    profiles = []
    urls_to_keep = ['https://www.facebook.com', 'https://www.twitter.com', 'https://www.instagram.com']

    req = requests.get(url)
    soup = BeautifulSoup(req.text, 'html.parser')
    container = soup.find('div', attrs={'class', 'main-container'})

    for profile in container.find_all('a'):
        href = profile.get('href')

        if any(word in str(href) for word in urls_to_keep):
            profiles.append(href)

    for profile in profiles:
        print(profile)


get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452')
get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250')

CodePudding user response：

You can find several values that you need. The any operator is used for this.

from bs4 import BeautifulSoup
import requests


def get_profile(url):
    profiles = []
    social_networks = ["https://www.facebook.com", "https://www.twitter.com", "https://www.instagram.com"]
    req = requests.get(url)
    for profile in BeautifulSoup(req.text, 'html.parser').find('div', attrs={'class', 'main-container'}).find_all('a'):
        if profile.get('href') and any(link in profile.get('href') for link in social_networks):
            profiles.append(profile.get('href'))
    return profiles


print(get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452'))
print(get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250'))

OUTPUT:

['https://www.facebook.com/CarmeloAnthony', 'https://www.instagram.com/carmeloanthony']
['https://www.facebook.com/LeBron', 'https://www.instagram.com/kingjames']