Home > Back-end >  How to handle mismatch of data while combining wanted_list items in AutoScraper library in python?
How to handle mismatch of data while combining wanted_list items in AutoScraper library in python?

Time:10-19

I am using AutoScraper library to scrap Q&A data from travel website. I am currently scraping from https://www.holidify.com/places/manali

I want to scrap frequently asked Questions and Answers. This is the code which I am using:

from autoscraper import AutoScraper
import pandas as pd
question_df=pd.DataFrame()
answer_df=pd.DataFrame()
url = "https://www.holidify.com/places/manali"
wanted = ["What is famous about Manali?" ,"Very beautiful. Snows through most of January and February. Can be visited throughout the year. Snow sports, paragliding and other adventure sports available. Starting point for drives to Spiti valley and Ladakh."]
scraper = AutoScraper()
scraper.build(url, [wanted[0]])
question_df["question"] = scraper.get_result_similar(url, unique=False,keep_blank=True,keep_order=True)
scraper.build(url, [wanted[1]])
answer_df["answer"] = scraper.get_result_similar(url, unique=False,keep_blank=True,keep_order=True)
print(len(question_df), len(answer_df))

There is a mismatch in the length and I am not able to combine question and answer into a single dataframe. Since there is a mismatch there is a possibility like question and answer mapped wrongly. Is there any way handle missing data and map question and answer appropriately while using AutoScraper or any other web-scrapping library

CodePudding user response:

Not so familiar with autoscraper, I would still recommend changing your scraping strategy, cause main issue seems to be that if elements could not be found it will not appear in the list so you have to handle that.

Instead of creating detached lists, try to capture and persist the information in a more structured way and in relation to each other.

Here are two examples of what I would do with requests and beautifulsoup.

Example

Select your elements more specific and scrape both information in one go:

import requests
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get('https://www.holidify.com/places/manali')
soup = BeautifulSoup(r.text)

data = []

for e in soup.select('#accordionFlexibleParagraphs .card'):
    data.append({
        'question': e.h2.get_text(strip=True),
        'answer': e.select_one('.card-body').get_text(strip=True) if e.select_one('.card-body') else None
    })

pd.DataFrame(data)

or by extracting the JSON data from a script tag:

import requests, json
from bs4 import BeautifulSoup
import pandas as pd

r = requests.get('https://www.holidify.com/places/manali')
soup = BeautifulSoup(r.text)

pd.json_normalize(json.loads(soup.select_one('[type="application/ld json"]:-soup-contains("FAQPage")').text), 'mainEntity')
  • Related