I'm still new to Python and thanks to everyone for the earlier help. I am trying to parse a webscraped bs4 element with no tables into a df. The data I need is identified as 'pre'. I thought using read_html with the right attributes would work, but I'm getting a None value from the bs4 element.
Code:
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'
response = requests.get(url) #reply from website
soup = BeautifulSoup(response.text, 'html5lib')#html data from the website, parsed in lxml by beautifulsoup
data = soup.select('pre')[1]#selects second block of 'pre' - containing the needed data
#print(data.text.strip())#prints the data
input= pd.read_html(data, attrs = {'pre':'table'})#reads html data
df1=pd.DataFrame(input, index=None,)
CodePudding user response:
Trying to use StringIO
method with pandas read_csv()
import io
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/103.0.0.0 Safari/537.36'}
url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'
response = requests.get(url) #reply from website
soup = BeautifulSoup(response.text, 'html5lib')#html data from the website, parsed in lxml by beautifulsoup
data = soup.select('pre')[1]#selects second block of 'pre' - containing the needed data
#print(data.text.strip())#prints the data
df = pd.read_csv(io.StringIO(data.text))
df = df.xs('BEGIN DATA', axis=1, drop_level=True)
print(df.iloc[:-1])
Output:
DATE TIME CHRO Q
07/12/2022 00:00 10.60
07/12/2022 00:15 10.60
07/12/2022 00:30 10.60
07/12/2022 00:45 10.60
...
07/19/2022 22:45 9.36
07/19/2022 23:00 9.36
07/19/2022 23:15 9.36
07/19/2022 23:30 9.36
07/19/2022 23:45 9.36
Length: 769
CodePudding user response:
That <pre>
element contains data in text format. If you want to transform it into a dataframe, you would need to look at text structure, and you are lucky here, as all lines have the same structure. So what you can do is:
from bs4 import BeautifulSoup
import requests
import pandas as pd
headers= {'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/102.0.0.0 Safari/537.36'}
url = 'https://www.usbr.gov/pn-bin/instant.pl?parameter=CHRO q&syer=2022&smnth=7&sdy=12&eyer=2022&emnth=7&edy=19&format=2'
r = requests.get(url, headers=headers)
soup = BeautifulSoup(r.text, 'lxml')
data = soup.select('pre')[1]
df_list = []
for line in data.text.strip().splitlines()[2:-1]:
df_list.append((line.split(' ')[0].strip(), line.split(' ')[1].split(',')[0].strip(), line.split(',')[1].strip()))
df = pd.DataFrame(df_list, columns = ['DATE', 'TIME', 'CHRO Q'])
print(df)
This will return a (real) dataframe, 768 rows × 3 columns:
DATE TIME CHRO Q
0 07/12/2022 00:00 10.60
1 07/12/2022 00:15 10.60
2 07/12/2022 00:30 10.60
3 07/12/2022 00:45 10.60
4 07/12/2022 01:00 10.60
... ... ... ...