Selecting values from each column separately-CodePudding

I need to write a script which sums values from each column (each column is a separate day). In addition, I want to separate the values into planned (blue color) and unplanned (red color). In the HTML code, I found that the unplanned values have a class name as "colBox cal-unplanned" and the planned values have a class name as "colBox cal-planned".

My code:

import pandas as pd
import requests
from bs4 import BeautifulSoup

URL = 'http://gpi.tge.pl/zestawienie-ubytkow' 
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

# Here I tried to convert the data into a dataframe, but then you don't know which values are planned and which are unplanned

table = soup.find_all('table')
df = pd.read_html(str(table),header=2)[0]

# Here the values are correct, but they are collected from the whole table 
sum = 0
for tr in soup.find_all('td', class_='colBox cal-unplanned'):
    val = int(tr.text)
    sum  = val
print(sum)

for tr in soup.find_all('td', class_='colBox cal-planned'):
    print(tr.text)

And here's my question. How can I select values from each column separately

CodePudding user response：

So if I understood right, you want to work on single columns of your dataframe? You could try to use this df['column_name'] to access a certain column of the df and then filter this column for the value you want to use like

df['column_name'] == filter_value

But then again I'm not sure I get your problem. This helped me heaps with dataframe value selection.

CodePudding user response：

Not sure there's a better way, but you can iterate through the table and store the planned and unplanned into separate values under the key of the column name. Then sum up those values and Then use that dictionary to convert to a dataframe.

But you're right, you lose that attribute in parsing it with .read_html().

This works, but not sure how robust it is for your situation.

import pandas as pd
import requests
from bs4 import BeautifulSoup

URL = 'http://gpi.tge.pl/zestawienie-ubytkow' 
page = requests.get(URL)
soup = BeautifulSoup(page.content, 'html.parser')

table = soup.find('table')

data = {}
headers = [x.text.strip() for x in table.find_all('tr')[2].find_all('th')]
for header in headers:
    data[header] = {'planned':[],
                    'unplanned':[]}

rows = table.find_all('tr')[3:]
for row in rows:
    tds = row.find_all('td')[3:len(headers) 3]
    for idx, value in enumerate(tds):
        if value.has_attr("class"):
            if 'cal-planned' in value['class']:
                data[headers[idx]]['planned'].append(int(value.text.strip()))
            elif 'cal-unplanned' in value['class']:
                data[headers[idx]]['unplanned'].append(int(value.text.strip()))


sum_of_columns = {}
for col, values in data.items():
    planned_sum = sum(values['planned'])
    unplanned_sum = sum(values['unplanned'])
    
    sum_of_columns[col] = {'planned':planned_sum,
                           'unplanned':unplanned_sum}

   
df = pd.DataFrame.from_dict(sum_of_columns,orient="columns" )

Output:

print(df.to_string())
           Cz 14  Pt 15  So 16  N 17  Pn 18  Wt 19  Śr 20  Cz 21  Pt 22  So 23  N 24  Pn 25  Wt 26  Śr 27
planned     8808   8301   7750  6863   6069   6199   6069   5627   5627   5695  5695   5235   5235   5376
unplanned   2320   2020   2313  2783    950    950    950    950    950    950   950    910    910    910

CodePudding user response：

Not sure if this is necessarily an issue for bs4, because I think the information is already in the DataFrame as a sum.

How to access?

Take a look at the tail() of your dataframe:

df.tail(3)

Example

import pandas as pd

URL = 'http://gpi.tge.pl/zestawienie-ubytkow' 

df = pd.read_html(URL,header=2)[0]
df.tail(3).iloc[:,2:]

Output

    Moc Osiągalna (MW)  Cz 14   Pt 15   So 16   N 17    Pn 18   Wt 19   Śr 20   Cz 21   Pt 22   So 23   N 24    Pn 25   Wt 26   Śr 27
219 Planowane           11279   10604   8391    6863    6069    6432    6069    5627    5627    5695    5695    5235    5235    5376
220 Nieplanowane        5520    5620    2313    2783    950 950 950 950 950 950 950 910 910 910
221 Łącznie ubytki      16799   16224   10704   9646    7019    7382    7019    6577    6577    6645    6645    6145    6145    6286