I need to write a script which sums values from each column (each column is a separate day). In addition, I want to separate the values into planned (blue color) and unplanned (red color). In the HTML code, I found that the unplanned values have a class name as "colBox cal-unplanned" and the planned values have a class name as "colBox cal-planned".
My code:
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = 'http://gpi.tge.pl/zestawienie-ubytkow'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
# Here I tried to convert the data into a dataframe, but then you don't know which values are planned and which are unplanned
table = soup.find_all('table')
df = pd.read_html(str(table),header=2)[0]
# Here the values are correct, but they are collected from the whole table
sum = 0
for tr in soup.find_all('td', class_='colBox cal-unplanned'):
val = int(tr.text)
sum = val
print(sum)
for tr in soup.find_all('td', class_='colBox cal-planned'):
print(tr.text)
And here's my question. How can I select values from each column separately
CodePudding user response:
So if I understood right, you want to work on single columns of your dataframe?
You could try to use this df['column_name']
to access a certain column of the df and then filter this column for the value you want to use like
df['column_name'] == filter_value
But then again I'm not sure I get your problem. This helped me heaps with dataframe value selection.
CodePudding user response:
Not sure there's a better way, but you can iterate through the table and store the planned and unplanned into separate values under the key of the column name. Then sum up those values and Then use that dictionary to convert to a dataframe.
But you're right, you lose that attribute in parsing it with .read_html()
.
This works, but not sure how robust it is for your situation.
import pandas as pd
import requests
from bs4 import BeautifulSoup
URL = 'http://gpi.tge.pl/zestawienie-ubytkow'
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')
table = soup.find('table')
data = {}
headers = [x.text.strip() for x in table.find_all('tr')[2].find_all('th')]
for header in headers:
data[header] = {'planned':[],
'unplanned':[]}
rows = table.find_all('tr')[3:]
for row in rows:
tds = row.find_all('td')[3:len(headers) 3]
for idx, value in enumerate(tds):
if value.has_attr("class"):
if 'cal-planned' in value['class']:
data[headers[idx]]['planned'].append(int(value.text.strip()))
elif 'cal-unplanned' in value['class']:
data[headers[idx]]['unplanned'].append(int(value.text.strip()))
sum_of_columns = {}
for col, values in data.items():
planned_sum = sum(values['planned'])
unplanned_sum = sum(values['unplanned'])
sum_of_columns[col] = {'planned':planned_sum,
'unplanned':unplanned_sum}
df = pd.DataFrame.from_dict(sum_of_columns,orient="columns" )
Output:
print(df.to_string())
Cz 14 Pt 15 So 16 N 17 Pn 18 Wt 19 Śr 20 Cz 21 Pt 22 So 23 N 24 Pn 25 Wt 26 Śr 27
planned 8808 8301 7750 6863 6069 6199 6069 5627 5627 5695 5695 5235 5235 5376
unplanned 2320 2020 2313 2783 950 950 950 950 950 950 950 910 910 910
CodePudding user response:
Not sure if this is necessarily an issue for bs4, because I think the information is already in the DataFrame as a sum.
How to access?
Take a look at the tail()
of your dataframe:
df.tail(3)
Example
import pandas as pd
URL = 'http://gpi.tge.pl/zestawienie-ubytkow'
df = pd.read_html(URL,header=2)[0]
df.tail(3).iloc[:,2:]
Output
Moc Osiągalna (MW) Cz 14 Pt 15 So 16 N 17 Pn 18 Wt 19 Śr 20 Cz 21 Pt 22 So 23 N 24 Pn 25 Wt 26 Śr 27
219 Planowane 11279 10604 8391 6863 6069 6432 6069 5627 5627 5695 5695 5235 5235 5376
220 Nieplanowane 5520 5620 2313 2783 950 950 950 950 950 950 950 910 910 910
221 Łącznie ubytki 16799 16224 10704 9646 7019 7382 7019 6577 6577 6645 6645 6145 6145 6286