I am doing web-scraping, and my task is, given list of movie name, find some data (IMDBid, cast etc.) about it from imdb website.
So first i did a google search "IMDB Movie_Name" and try to scrap google search result to get URL to goto imdb movie title page.
url = 'https://www.google.com/search?q=IMDB title taare zameen par'
headers = {'Accept-Language': 'en-US, en;q=0.5'}
page = get(url, headers = headers)
soup = BeautifulSoup(page.text, 'html.parser')
my = soup.find_all('a', attrs={'href': re.compile("https://www.imdb.com/title/")})
for i in my:
print(i.get('href'))
The result I am getting is like:
/url?q=https://www.imdb.com/title/tt0986264/&sa=U&ved=2ahUKEwitxIKpj4f4AhUoppUCHfNjB3MQtwJ6BAgEEAI&usg=AOvVaw3zzfaZDFa8tmhGcIRS7_sV
My question is how to get a part that is "*https://www.imdb.com/title/tt0986264/*"?
but the logic used should be : if:
- string start from "https://www.imdb.com/title/tt"
- and string ends with "/" then return me that sub-string.
CodePudding user response:
There is two way,
REGEX (https://regex101.com/) all languages supports regex ( ie : php preg_match, .net ( new Regex("...") regex.Match.. , js Regex match search etc )
Exploding ( splitting ) string ( for your case "/" or "https://..../.../" etc ) and take the part of what you need
CodePudding user response:
You can apply .split()
method to get your desired substrings
import re
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=IMDB title taare zameen par'
headers = {'Accept-Language': 'en-US, en;q=0.5'}
page =requests. get(url, headers = headers)
soup = BeautifulSoup(page.text, 'html.parser')
my = soup.find_all('a', attrs={'href': re.compile("https://www.imdb.com/title/")})
for i in my:
href=i.get('href').split('&')[0].rsplit('/',1)[0] ('/')
print(href.replace('/url?q=',''))
Output:
https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt13300004/
https://www.imdb.com/title/tt0986264/