Home > Software engineering >  How to extract a string from url starting and ending with specific letters
How to extract a string from url starting and ending with specific letters

Time:05-31

I am doing web-scraping, and my task is, given list of movie name, find some data (IMDBid, cast etc.) about it from imdb website.

So first i did a google search "IMDB Movie_Name" and try to scrap google search result to get URL to goto imdb movie title page.

url = 'https://www.google.com/search?q=IMDB title taare zameen par'
headers = {'Accept-Language': 'en-US, en;q=0.5'}
page = get(url, headers = headers)

soup = BeautifulSoup(page.text, 'html.parser')
my = soup.find_all('a', attrs={'href': re.compile("https://www.imdb.com/title/")})
for i in my:
    print(i.get('href'))

The result I am getting is like:

/url?q=https://www.imdb.com/title/tt0986264/&sa=U&ved=2ahUKEwitxIKpj4f4AhUoppUCHfNjB3MQtwJ6BAgEEAI&usg=AOvVaw3zzfaZDFa8tmhGcIRS7_sV

My question is how to get a part that is "*https://www.imdb.com/title/tt0986264/*"?

but the logic used should be : if:

  1. string start from "https://www.imdb.com/title/tt"
  2. and string ends with "/" then return me that sub-string.

CodePudding user response:

There is two way,

  • REGEX (https://regex101.com/) all languages supports regex ( ie : php preg_match, .net ( new Regex("...") regex.Match.. , js Regex match search etc )

  • Exploding ( splitting ) string ( for your case "/" or "https://..../.../" etc ) and take the part of what you need

CodePudding user response:

You can apply .split() method to get your desired substrings

import re
from bs4 import BeautifulSoup
import requests
url = 'https://www.google.com/search?q=IMDB title taare zameen par'
headers = {'Accept-Language': 'en-US, en;q=0.5'}
page =requests. get(url, headers = headers)

soup = BeautifulSoup(page.text, 'html.parser')
my = soup.find_all('a', attrs={'href': re.compile("https://www.imdb.com/title/")})
for i in my:
    href=i.get('href').split('&')[0].rsplit('/',1)[0]   ('/')
    print(href.replace('/url?q=',''))

Output:

https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt0986264/
https://www.imdb.com/title/tt13300004/
https://www.imdb.com/title/tt0986264/
  • Related