Home > Net >  Not Able To Scrape Website Title - Python Bs4
Not Able To Scrape Website Title - Python Bs4

Time:11-30

I am trying to get the titles of game but with title i am getting span text also

here is my code

import time
import requests,pandas
from bs4 import BeautifulSoup


r = requests.get("https://www.pocketgamer.com/android/best-horror-games/?page=1", headers=        
{'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 
Firefox/61.0'})
c = r.content
bs4 = BeautifulSoup(c,"html.parser")

all = bs4.find_all("h3",{"class":"indent"}) 
print(all)

Output

[<h3 >
<div><span>1</span></div>
Fran Bow </h3>, <h3 >
<div><span>2</span></div>
Bendy and the Ink Machine </h3>, <h3 >
<div><span>3</span></div>
Five Nights at Freddy's </h3>, <h3 >
<div><span>4</span></div>
Sanitarium </h3>, <h3 >
<div><span>5</span></div>
OXENFREE </h3>, <h3 >
<div><span>6</span></div>
Thimbleweed Park </h3>, <h3 >
<div><span>7</span></div>
Samsara Room </h3>, <h3 >

i tried this code also but not working

#all = all.find_all("h3")[0].text

CodePudding user response:

Here is the minimal working solution

Code:

import time
import requests,pandas
from bs4 import BeautifulSoup


r = requests.get("https://www.pocketgamer.com/android/best-horror-games/?page=1", headers=        
{'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
bs4 = BeautifulSoup(c,"html.parser")

all = bs4.find_all("h3",{"class":"indent"}) 
for title in all:
    print(' '.join(title.text.split()[1:]))

Output:

Fran Bow
Bendy and the Ink Machine
Five Nights at Freddy's
Sanitarium
OXENFREE
Thimbleweed Park
Samsara Room
Into the Dead 2
Slayaway Camp
Eyes - the horror game
Slendrina:The Cellar
Hello Neighbor
Alien: Blackout
Rest in Pieces
Friday the 13th: Killer Puzzle
I Am Innocent
Detention
Limbo
Knock-Knock
Sara Is Missing
Death Park: Scary Horror Clown
Horror Hospital 2
Horrorfield - Multiplayer Survival Horror Game
Erich Sann: Horror in the scary Academy       
The Innsmouth Case

CodePudding user response:

How to fix?

Cause the text you wanna get is always the last element in <h3> you can extract it by contents of <h3>.

element.contents[-1]

To get the text iterate over result set:

for x in bs4.find_all("h3",{"class":"indent"}):
    print(x.contents[-1].get_text(strip=True))

Example

import requests,pandas
from bs4 import BeautifulSoup


r = requests.get("https://www.pocketgamer.com/android/best-horror-games/?page=1", 
headers={'User-agent': 'Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:61.0) Gecko/20100101 Firefox/61.0'})
c = r.content
bs4 = BeautifulSoup(c,"html.parser")

all = [x.contents[-1].get_text(strip=True) for x in bs4.find_all("h3",{"class":"indent"})]
print(all)

Output

['Fran Bow', 'Bendy and the Ink Machine', "Five Nights at Freddy's", 'Sanitarium', 'OXENFREE', 'Thimbleweed Park', 'Samsara Room', 'Into the Dead 2', 'Slayaway Camp', 'Eyes - the horror game', 'Slendrina:The Cellar', 'Hello Neighbor', 'Alien: Blackout', 'Rest in Pieces', 'Friday the 13th: Killer Puzzle', 'I Am Innocent', 'Detention', 'Limbo', 'Knock-Knock', 'Sara Is Missing', 'Death Park: Scary Horror Clown', 'Horror Hospital 2', 'Horrorfield - Multiplayer Survival Horror Game', 'Erich Sann: Horror in the scary Academy', 'The Innsmouth Case']
  • Related