Home > Software engineering >  "Missing"/Hidden HTML Code Stalling Webscraper Development
"Missing"/Hidden HTML Code Stalling Webscraper Development

Time:12-24

I am a novice programmer attempting to create a web scraping program with the end goal of accelerating the rate of conversion between .ict and .csv files for NASA EarthData programs. I am planning on using the BeautifulSoup Python library to gather the data from the webpage and then convert it into a table, which I will then convert to a .csv file. The first link I am planning on converting is: https://asdc.larc.nasa.gov/data/AJAX/O3_1/2018/02/28/AJAX-O3_ALPHA_20180228_R1_F220.ict

Upon opening the DevTools of Chrome to find the HTML code behind the columns, I was surprised to see a lack of code: Lack of HTML Data

Could someone help me to understand the way of parsing through the .ict file and then obtaining this data to transform into a table?

Ideally, I intend on having 7 columns ('Int_Start', 'Int_End', 'TIME', 'G_Lat', 'G_Lon', 'G_Alt', 'O3'). Under each column, I plan on assigning all of the values in the seven columns seen in the image to their respective columns, which I will then export to a .csv file.

The website is behind a NASA EarthData authentication wall, which I have logged into using the following code:

link = 'https://urs.earthdata.nasa.gov/login'

with requests.Session() as s:
    s.headers['User-Agent'] = 'Mozilla/5.0 (Windows NT 6.1) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/88.0.4324.104 Safari/537.36'
    r = s.get(url)
    soup = BeautifulSoup(r.text,"lxml")
    payload = {i['name']:i.get('value','') for i in soup.select('input[name]')}
    payload['username'] = 'username'
    payload['password'] = 'password'
# For the program to work, each user will need to input their username and password in the lines above.
    res = s.post(link,data=payload)
    res = s.get(url)
    print(res.text)

where I insert my personal information for the payload username and password. Any advice for other libraries to utilize or how to access the HTML for the data would be much appreciated. Thank you.

CodePudding user response:

I was able to solve the problem by adding the code:

html_data = res.text
soup = BeautifulSoup(html_data, 'lxml')
print(soup.prettify())

in the next cell. There's three tags: <HTML>, <body>, and <p>.

  • Related