I am trying to find examples of scrapers successfully detecting APIs used on websites. One example I know is the web extension "Built With" , but it fails to detect APIs from time to time, especially browser APIs.
Any help would be appreciated Thank you in advance
CodePudding user response:
You can use Puppeteer to scrape all request made by the page to detect APIs used. I've used this function to scrape JSON data from API requests made by the site instead of scrapping the page content itself, though it can be used to scrape the list of APIs called as well. Hope it helps.
async scrapePageNetwork(url: string, ...spiedRequests: string[]) {
const result = [];
try {
const browser = await puppeteer.launch();
const page = await browser.newPage();
await page.setRequestInterception(true);
page.on('request', request => {
requestClient({
uri: request.url(),
resolveWithFullResponse: true,
}).then((response: any) => {
result.push({
url: request.url(),
headers: request.headers(),
postData: request.postData(),
response: {
headers: response.headers,
size: response.headers['content-length'],
body: response.body
}
});
request.continue();
}).catch((error: any) => {
console.error(error);
request.abort();
});
});
await page.goto(url, { waitUntil: 'networkidle0' });
await browser.close();
} catch (err) {
console.log("Failed to parse requests", err)
}
return result.filter((req) => spiedRequests.includes(req.url))
}
Please note that this works only with the requests made directly by the page, i.e. if API called connects to a different API on the backend you won't be able to see that.