Relevant part of the DOM: Screenshot of the DOM
This is the code I wrote:
from bs4 import BeautifulSoup
import requests
URL = 'https://www.cheapflights.com.sg/flight-search/SIN-KUL/2022-06-04?sort=bestflight_a&attempt=3&lastms=1653844067064'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
flight = soup.find('div', class_= 'resultWrapper')
print(flight)
The result that I get whenever print(flight) is executed is always None. I have tried changing to div tags with different class names but it still always returns None. The soup seems to be fine though because when I execute print(soup) it returns a text version of the DOM so the problem seems to be with the next line
Any suggestions on how I can get something other than None? Thank you!
CodePudding user response:
That's because of the User-Agent. If I try to curl the page without changing the default User-Agent, it'll return this page.
Change your code like this, to avoid that your program gets detected:
headers = {
"User-Agent": "Mozilla/5.0 (X11; Linux x86_64) ..."
}
page = requests.get(URL, headers=headers)
CodePudding user response:
BeautifulSoup .find() always returns None
because the data is sent via Ajax request from external url which is API calls HTML response. So to grab data; you have to use API
url.
Example with full working code:
import requests
import json
from bs4 import BeautifulSoup
payload = 'searchId=LbEiRwhKF_&poll=true&pollNumber=0&applyFilters=true&filterState=&useViewStateFilterState=true&pageNumber=1&watchedResultId=&append=false&sortMode=bestflight&ascending=true&priceType=daybase&requestReason=POLL&phoenixRising=true&isSecondPhase=false&displayAdPageLocations=left,bottom-left,bottom,upper-right,right&existingAds=false&activeLeg=-1&hasFilterPreferences=false&view=list&renderPlusMinusThreeFlex=false&renderAirlineStopsMatrix=false&requestAlternateFlexDates=false&ajaxts=1653846256069&scriptsMetadata=16wB&B20Q1CQUBI3QEH&g9$g6B21CBiIiwYar1#CgI9CEB5g5UBD1L5I1D32Gg14gCF&PgE2B1iw1osDQBiz!1QF1EI1B3Eg30Cg=&stylesMetadata=22Dg1U74giE9E4Q18g1C1EB4Q16C21G3I1zQc1HhhYMIw6JQ2Q1gE7Q1gQI2IQ1g3G43HQ24C12ju1e4ICQGrI1h1wCQ%&CCK1ED7I289C50Q79BlRfBHg6BQ2Q36E4C1CRIC45B367IkQJ1gSZIkSZ10CI1k1BI10J121k87gQR1g53I1kSR30Q4J1kSZIgSZIkCQ9k1Q3JIkCQI1SQIkSZIkSQ3JIgCQ1gC11QJIkSZIkSRIgSQ85gSZI1SJIkSZIkSQ1kCR1gSJIk1Q10QZ3Q2SBIE1BI97Q10k1Q5g5Z1k1Z1E1I1kSRIECZIgQRIkSZIkCQ1ESIIEQQI1SZI460gSZIkC182kCQ==&r9version=R618c'
api_url= 'https://www.cheapflights.com.sg/s/horizon/flights/results/FlightSearchPoll?p=0'
headers= {
"content-type": "application/x-www-form-urlencoded",
"user-agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.5005.62 Safari/537.36",
"x-csrf": "$_DAcoZoQE$1Qxq2wktuQeX1wM66H6xJBg6LXd0u0KM-biFU0Ll7e68O886T7kg2pLDoSs$ycoT1x9xj50oeIEA",
"x-r9-blue-green-version": "R618c",
"x-requested-with": "XMLHttpRequest",
"x-requestid": "flights#results#dfYzFA"
}
req=requests.post(api_url,headers=headers,data=payload)#.json()
data = json.loads(req.text)['content']
# with open('ajax.html', 'w', encoding="utf-8") as f:
# f.write(data)
soup = BeautifulSoup(data, 'lxml')
for price in soup.select('.price-text'):
print(price.get_text(strip=True))
Output:
S$ 135
S$ 120
S$ 120
S$ 121
S$ 126
S$ 127
S$ 127
S$ 133
S$ 133
S$ 134
S$ 135
S$ 137
S$ 146
S$ 164
S$ 120
S$ 148
S$ 157
S$ 160
S$ 165
S$ 167
S$ 171
S$ 174
S$ 177
S$ 178
S$ 180
S$ 184
S$ 189
S$ 192
S$ 286
S$ 146
S$ 154
S$ 154
S$ 157
S$ 163
S$ 167
S$ 168
S$ 168
S$ 177
S$ 184
S$ 149
S$ 157
S$ 174
S$ 176
S$ 176
S$ 187
S$ 191
S$ 191
S$ 200
S$ 211
S$ 149
S$ 154
S$ 154
S$ 157
S$ 163
S$ 167
S$ 168
S$ 168
S$ 177
S$ 184
S$ 149
S$ 152
S$ 153
S$ 162
S$ 164
S$ 165
S$ 172
S$ 174
S$ 182
S$ 183
S$ 187
S$ 191
S$ 200
S$ 200
S$ 223
S$ 151
S$ 182
S$ 165
S$ 167
S$ 169
S$ 169
S$ 171
S$ 171
S$ 174
S$ 177
S$ 178
S$ 180
S$ 184
S$ 189
S$ 190
S$ 193
S$ 160
S$ 176
S$ 176
S$ 180
S$ 187
S$ 191
S$ 191
S$ 198
S$ 201
S$ 211
S$ 171
S$ 175
S$ 188
S$ 189
S$ 190
S$ 196
S$ 199
S$ 202
S$ 207
S$ 209
S$ 209
S$ 210
S$ 213
S$ 213
S$ 246
S$ 174
S$ 188
S$ 189
S$ 190
S$ 190
S$ 197
S$ 199
S$ 199
S$ 202
S$ 209
S$ 209
S$ 210
S$ 213
S$ 213
S$ 246
S$ 175
S$ 182
S$ 198
S$ 198
S$ 203
S$ 211
S$ 213
S$ 215
S$ 220
S$ 226
S$ 239
S$ 193
S$ 247
S$ 251
S$ 255
S$ 256
S$ 256
S$ 259
S$ 259
S$ 260
S$ 266
S$ 267
S$ 269
S$ 286
S$ 323
S$ 236
S$ 132
S$ 132
S$ 133
S$ 139
S$ 143
S$ 144
S$ 146
S$ 155
S$ 158
S$ 127