Is it possible to do a web scrapping on ebi.ac.uk/interpro website?-CodePudding

I would like to get a table on ebi.ac.uk/interpro with the list of all the thousands of proteins names, accession number, species, and length for the entry I put on the website. I tried to write a script with python using requests, BeautifulSoup, and so on, but I am always getting the error

AttributeError: 'NoneType' object has no attribute 'find_all'.

The code

import requests
from bs4 import BeautifulSoup

# Set the URL of the website you want to scrape
url = xxxx

# Send a request to the website and get the response
response = requests.get(url)

# Parse the response using BeautifulSoup
soup = BeautifulSoup(response.text, "html.parser")

# Find the table on the page
table = soup.find("table", class_ = 'xxx')

# Extract the data from the table
# This will return a list of rows, where each row is a list of cells
table_data = []
for row in table.find_all('tr'):
    cells = row.find_all("td")
    row_data = []
#    for cell in cells:
 #       row_data.append(cell.text)
  #  table_data.append(row_data)

# Print the extracted table data
#print(table_data)

for table = soup.find("table", class_ = 'xxx'), I fill in the class according to the name when I inspect the page.

Thank you.

I would like to get a table listing all the thousands of proteins that the website lists back from my request

CodePudding user response：

sure it is take a look at this example:

import requests

url = "https://www.ebi.ac.uk/interpro/wwwapi/entry/hamap/"

querystring = {"search":"","page_size":"9999"}

payload = ""
response = requests.request("GET", url, data=payload, params=querystring)

print(response.text)

Please do not use selenium unless absolutely necessary. In the following example we request all the entries from /hamap/ I have no idea what this means but this is the API used to fetch the data. You can get the API for the dataset you want to scrape data from by doing the following:

open chrome dev tools -> network -> click Fetch/XAR -> click on the specific source you want -> wait until the page loads -> click the red icon for record -> look through the requests for the one that you want. It is important to not record requests after you retrieved the initial response. This website sends a tracking request every 1 second or so and it becomes cluttered really quick. Once you have the source that you want just loop over the array and get the fields that you want. I hope this answer was useful to you.

CodePudding user response：

Hey I checked it out some more this site uses something similar to Elasticsearch's scroll here is a full implementation of what you are looking for:

import requests
import json

results_array = []


def main():
    count = 0
    starturl = "https://www.ebi.ac.uk/interpro/wwwapi//protein/UniProt/entry/InterPro/IPR002300/?page_size=100&has_model=true"  ## This is the URL you want to scrape on page 0
    startpage = requests.get(starturl)  ## This is the page you want to scrape
    count  = int(startpage.json()['count'])  ## This is the total number of indexes
    next = startpage.json()['next']  ## This is the next page
    for result in startpage.json()['results']:
        results_array.append(result)
    while count:
        count -= 100
        nextpage = requests.get(next)  ## this is the next page
        if nextpage.json()['next'] is None:
            break
        next = nextpage.json()['next']
        for result in nextpage.json()['results']:
            results_array.append(result)
        print(json.dumps(nextpage.json()))
        print(count)


if __name__ == '__main__':
    main()
    with open("output.json", "w") as f:
        f.write(json.dumps(results_array))

To use this for any other type replace the startURL string with that one. make sure it is the url that controls pages. To get this click on the data you want then click on the next page use that url.

I hope this answer is what you were looking for.