Home > Blockchain >  parent-child relation data scraping with selenium, beautifulsoup
parent-child relation data scraping with selenium, beautifulsoup

Time:03-13

I hope you're all doing good! I'm trying to scrape this list (https://cov-lineages.org/lineage_list.html) of lineages, and the Lineages are parent-child related. What I have to do:

  1. loop through the list (this one https://cov-lineages.org/lineage_list.html) and click each element scrape its data
  2. then go to a link (in the same page) that has the mutation table of each lineage and scrap it as well,
  3. scroll down to the table that has children of that lineage, loop through them, click each one of them and scrap its data, and also each child if it has children we should do the same process and scrap them. I've included here an Explanation by screenshots in a pdf file please take a look at it and see if you could come up with an idea on how can I implement trees or nested dictionaries.

CodePudding user response:

You do not need Selenium to perform this task, requests will do the job.

This code will get all the rows in the list:

import requests
from bs4 import BeautifulSoup

res = requests.get('https://cov-lineages.org/lineage_list.html')
soup = BeautifulSoup(res.text, 'html.parser')

rows = soup.find_all('tr')

for row in rows:
    print(row)

From here you can get all the individual cells with row.find_all('td'). Use the inspector CTRL SHIFT I to identify the html element needed.

CodePudding user response:

The data is all within the json source for the site to render it. Just get the data directly, it's more efficient. This will get all the data you'd scrape with Selenium in a fraction of the time. This will take seconds, as opposed to hours, by having Selenium clicking on each individual 1907 Parent links, followed by (I don't even know how many...but appears you'd have Selenium clicking on 2181 or so links total) sublinks under that.

import requests

url = 'https://raw.githubusercontent.com/cov-lineages/lineages-website/master/_data/lineage_data.json'
jsonData = requests.get(url).json()
jsonData = [v for k,v in jsonData.items()]

df = pd.json_normalize(jsonData)

Output:

print(df)
     Lineage  ...                                        Description
0          A  ...  Root of the pandemic lies within lineage A. Ma...
1        A.1  ...                                        USA lineage
2        A.2  ...  Mostly Spanish lineage now includes South and ...
3      A.2.2  ...                                 Australian lineage
4      A.2.3  ...                                   Scottish lineage
     ...  ...                                                ...
2176    *H.1  ...  Withdrawn: Alias of B.1.1.67.1, South African ...
2177     I.1  ...  Lineage reassigned. Withdrawn: Alias of B.1.1....
2178    *I.1  ...  Withdrawn: Alias of B.1.1.217.1, Latvian linea...
2179     J.1  ...  Lineage reassigned. Withdrawn: Alias of B.1.1....
2180    *J.1  ...  Withdrawn: Alias of B.1.1.250.1, Australian li...

[2181 rows x 9 columns]
  • Related