Brand new to learning Python, but very familiar with Google Sheets -- I'm essentially trying to mimic the "filter" function but cannot find anything on it.
The goal of my script is to pull the social media tags of NBA players (from the URLs).
I have it working to pull all links, but want to clean up my code so basically there's an if statement saying
If my results contain (https://www.facebook.com"), (https://www.twitter.com") or (https://www.instagram.com"), that would be the only info pulled.
Right now, it looks more like this:
It isn't the end of the world, because I can paste into a Google Sheet and clean, but it would be really nice to learn something like this.
from bs4 import BeautifulSoup
import requests
def get_profile(url):
profiles = []
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
for profile in container.find_all('a'):
profiles.append(profile.get('href'))
for profile in profiles:
print(profile)
get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452')
get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250')
CodePudding user response:
You can use the in
keyword to search for substrings. In your case, you could check each profile like so:
if "https://www.facebook.com" in profile:
print(profile)
in
returns True if it finds the substring.
CodePudding user response:
You can search a list to check if any of the items exist in the specific href you're checking like so:
from bs4 import BeautifulSoup
import requests
def get_profile(url):
profiles = []
urls_to_keep = ['https://www.facebook.com', 'https://www.twitter.com', 'https://www.instagram.com']
req = requests.get(url)
soup = BeautifulSoup(req.text, 'html.parser')
container = soup.find('div', attrs={'class', 'main-container'})
for profile in container.find_all('a'):
href = profile.get('href')
if any(word in str(href) for word in urls_to_keep):
profiles.append(href)
for profile in profiles:
print(profile)
get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452')
get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250')
CodePudding user response:
You can find several values that you need. The any
operator is used for this.
from bs4 import BeautifulSoup
import requests
def get_profile(url):
profiles = []
social_networks = ["https://www.facebook.com", "https://www.twitter.com", "https://www.instagram.com"]
req = requests.get(url)
for profile in BeautifulSoup(req.text, 'html.parser').find('div', attrs={'class', 'main-container'}).find_all('a'):
if profile.get('href') and any(link in profile.get('href') for link in social_networks):
profiles.append(profile.get('href'))
return profiles
print(get_profile('https://basketball.realgm.com/player/Carmelo-Anthony/Summary/452'))
print(get_profile('https://basketball.realgm.com/player/LeBron-James/Summary/250'))
OUTPUT:
['https://www.facebook.com/CarmeloAnthony', 'https://www.instagram.com/carmeloanthony']
['https://www.facebook.com/LeBron', 'https://www.instagram.com/kingjames']