Home > Mobile >  How to scrape individual text from one tag?
How to scrape individual text from one tag?

Time:06-10

I am trying to scrape data of Top 250 movies from IMDB.

from bs4 import BeautifulSoup
import requests
import pandas as pd

url="https://www.imdb.com/chart/top/?ref_=nv_mv_250"
page=requests.get(url).content
soup=BeautifulSoup(page,"html.parser")

data=[]
titles=soup.find_all("td",class_="titleColumn")
ratings=soup.find_all("td",class_="ratingColumn imdbRating")
for title,rating,year in zip(titles,ratings,years):
    data.append({"Title":title.text.replace("\n",""),
                "Rating":rating.text.replace("\n","")})
pd.DataFrame(data)

And I get this result:

enter image description here

As you can see Title column includes index, title and release year of the movie. I want to get these texts separately in different columns. enter image description here

CodePudding user response:

You can use str.extract. You don't need requests and bs4 here, you can directly get the data with pd.read_html:

# Get the table
df = pd.read_html(url)[0]

# Extract Rank, Title, Year from 'Rank & Title' column
pat = r'(?P<Rank>\d )\.\s (?P<Title>[^\(] )\s \((?P<Year>\d{4})\)'
df1 = df['Rank & Title'].str.extract(pat).astype({'Rank': int, 'Year': int})

# Merge previous dataframe with 'IMDb Rating' column
out = pd.concat([df1, df['IMDb Rating'].rename('Rating')], axis=1)

Output:

>>> out
     Rank                                 Title  Year  Rating
0       1                           Les Évadés   1994     9.2
1       2                           Le Parrain   1972     9.2
2       3  The Dark Knight : Le Chevalier noir   2008     9.0
3       4                Le Parrain, 2ᵉ partie   1974     9.0
4       5                  12 Hommes en colère   1957     8.9
..    ...                                   ...   ...     ...
245   246                              Aladdin   1992     8.0
246   247                               Gandhi   1982     8.0
247   248            La couleur des sentiments   2011     8.0
248   249                  La Belle et la Bête   1991     8.0
249   250                 Danse avec les loups   1990     8.0

[250 rows x 4 columns]

>>> out.dtypes
Rank        int64
Title      object
Year        int64
Rating    float64
dtype: object

CodePudding user response:

You can try .str.extract

df['rank'] = df['Title'].str.extract('^(\d )\.')
df['year'] = df['Title'].str.extract('\((\d )\)$')
df['Title'] = df['Title'].str.extract('^\d \. (.*) \(\d \)$')
print(df)

                      Title rank  year
0  The Shawshank Redemption    1  1994
1             The Godfather    2  1972
2           The Dark Knight    3  2008
3     The Godfather Part II    4  1974
4              12 Angry Men    5  1957

CodePudding user response:

Select all <tr> of <table> with an <td> and use .stripped_strings to extract and separate the text values of each element in the row:

for e in soup.select('table[data-caller-name="chart-top250movie"] tr:has(td)'):
    data.append(dict(zip(['rank','name','year','rating'],e.stripped_strings)))

To get rid of the "()" in year column you could replace() it like this:

df.year = df.year.str.replace(r'[()]','', regex=True)
Example
from bs4 import BeautifulSoup
import requests
import pandas as pd

url="https://www.imdb.com/chart/top/?ref_=nv_mv_250"
page=requests.get(url).content
soup=BeautifulSoup(page,"html.parser")

data=[]
for e in soup.select('table[data-caller-name="chart-top250movie"] tr:has(td)'):
    data.append(dict(zip(['rank','name','year','rating'],e.stripped_strings)))

df = pd.DataFrame(data)
df.year = df.year.str.replace(r'[()]','', regex=True)
df
Output
rank name year rating
0 1 Die Verurteilten 1994 9.2
1 2 Der Pate 1972 9.2
2 3 The Dark Knight 2008 9.0
3 4 Der Pate 2 1974 9.0
4 5 Die zwölf Geschworenen 1957 8.9

...

  • Related