I'm having response from beautiful soup scrapper in string like below
1:25Allahabad01 h 50 m Non stop13:15Pune
12:40Allahabad07 h 25 m 1 stop via New Delhi20:05Pune
Looking for a way to split it into expected output:
['1:25', 'Allahabad', '01 h 50 m', 'Non stop', '13:15', 'Pune'] ['12:40', 'Allahabad', '07 h 25 m', '1 stop via New Delhi', '20:05', 'Pune']
City names can be different, I'm thinking of regex but I'm not good in that so looking for some better approaches to do this.
Code how I'm getting these values is
driver = webdriver.Chrome(service=Service(ChromeDriverManager().install()), options=options)
url = "https://www.makemytrip.com/flight/search?itinerary=IXD-PNQ-14/07/2022&tripType=O&paxType=A-2_C-0_I-0&intl=false&cabinClass=E&ccde=IN&lang=eng"
driver.get(url)
body = driver.page_source
driver.quit() # Browser Closed.
soupBody = BeautifulSoup(body) # Parse the inner HTML using BeautifulSoup
for el in soupBody.find_all('div', attrs={'class': 'timingOptionOuter'}):
print(el.get_text())
CodePudding user response:
I agree with @HedgeHog, that the issue is the parsing with BS4.
Here's a solution anyway:
>>> import re
>>> regex = re.compile(r"(\d{1,2}:\d{1,2})([A-Za-z ] )((?:\d{1,2} h )?\d{1,2} m) (Non stop|\d stop[A-Za-z ] )(\d{1,2}:\d{1,2})([A-Za-z ] )")
>>> regex.findall("1:25Allahabad01 h 50 m Non stop13:15Pune\n12:40Allahabad07 h 25 m 1 stop via New Delhi20:05Pune")
[('1:25', 'Allahabad', '01 h 50 m', 'Non stop', '13:15', 'Pune'), ('12:40', 'Allahabad', '07 h 25 m', '1 stop via New Delhi', '20:05', 'Pune')]
CodePudding user response:
As mentioned there is no need for regex or any post-processing - scraping the text via get_text()
you could set additional parameter to join / separate the strings:
el.get_text(',', strip=True)
Note: In newer code avoid old syntax findAll()
instead use find_all()
- For more take a minute to check docs - Site is not aivailable from Europe so this is just a general approach how to fix your issue.