Home > Software engineering >  df is pulling a none value from bs4 element
df is pulling a none value from bs4 element

Time:07-21

I'm still new to Python and thanks to everyone for the earlier help. I am trying to parse a webscraped bs4 element with no tables into a df. The data I need is identified as 'pre'. I thought using read_html with the right attributes would work, but I'm getting a None value from the bs4 element.

Code:

headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'

response = requests.get(url) #reply from website
soup = BeautifulSoup(response.text, 'html5lib')#html data from the website, parsed in lxml by beautifulsoup
data = soup.select('pre')[1]#selects second block of 'pre' - containing the needed data
#print(data.text.strip())#prints the data

input= pd.read_html(data, attrs = {'pre':'table'})#reads html data
df1=pd.DataFrame(input, index=None,)

CodePudding user response:

Trying to use StringIO method with pandas read_csv()

import io
from bs4 import BeautifulSoup 
import requests
import pandas as pd
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'

response = requests.get(url) #reply from website
soup = BeautifulSoup(response.text, 'html5lib')#html data from the website, parsed in lxml by beautifulsoup
data = soup.select('pre')[1]#selects second block of 'pre' - containing the needed data
#print(data.text.strip())#prints the data

df = pd.read_csv(io.StringIO(data.text))
df = df.xs('BEGIN DATA', axis=1, drop_level=True)
print(df.iloc[:-1])

Output:

   DATE       TIME         CHRO    Q
07/12/2022 00:00                10.60       
07/12/2022 00:15                10.60       
07/12/2022 00:30                10.60       
07/12/2022 00:45                10.60       
                           ...
07/19/2022 22:45                 9.36       
07/19/2022 23:00                 9.36
07/19/2022 23:15                 9.36
07/19/2022 23:30                 9.36
07/19/2022 23:45                 9.36
 Length: 769

CodePudding user response:

That <pre> element contains data in text format. If you want to transform it into a dataframe, you would need to look at text structure, and you are lucky here, as all lines have the same structure. So what you can do is:

from bs4 import BeautifulSoup
import requests
import pandas as pd

headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}

url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('pre')[1]
df_list = []

for line in data.text.strip().splitlines()[2:-1]:
    df_list.append((line.split(' ')[0].strip(), line.split(' ')[1].split(',')[0].strip(), line.split(',')[1].strip()))
df = pd.DataFrame(df_list, columns = ['DATE', 'TIME', 'CHRO Q'])
print(df)

This will return a (real) dataframe, 768 rows × 3 columns:

DATE    TIME    CHRO Q
0   07/12/2022  00:00   10.60
1   07/12/2022  00:15   10.60
2   07/12/2022  00:30   10.60
3   07/12/2022  00:45   10.60
4   07/12/2022  01:00   10.60
... ... ... ...
  • Related