Home > OS >  Can't access site programmatically
Can't access site programmatically

Time:12-16

I'm trying to get a list of shutdowns from dtek-kem.com.ua/ua/shutdowns list But when I send a GET request via python, I get a response: unsuccessful request, Incapsula incident ID: ... Also I know this site uses imperva security

Sending a request using python aiohttp:

method='GET'
Host: www.dtek-kem.com.ua
accept: text/html,application/xhtml xml,application/xml;q=0.9,image/avif,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
accept-encoding: gzip, deflate, br
accept-language: en,ru;q=0.9,uk;q=0.8,en-US;q=0.7
user-agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/108.0.0.0 Safari/537.36
cache-control: max-age=0
sec-ch-ua: "Not?A_Brand";v="8", "Chromium";v="108", "Google Chrome";v="108"
sec-ch-ua-mobile: ?0
sec-ch-ua-platform: "Windows"
sec-fetch-dest: document
sec-fetch-mode: navigate
sec-fetch-site: same-origin
sec-fetch-user: ?1
upgrade-insecure-requests: 1

I get the following response:

https://www.dtek-kem.com.ua/ua/shutdowns [200 OK]
Content-Type: text/html
Cache-Control: no-cache, no-store
Connection: close
Content-Length: 899
X-Iinfo: 4-43048402-0 0NNN RT(1670585645218 54) q(0 -1 -1 -1) r(0 -1) B12(4,316,0) U2
Strict-Transport-Security: max-age=31536000; includeSubDomains
Set-Cookie: incap_ses_287_2224657=4b9AWuO2/2fTOuVPWqH7Ay0dk2MAAAAAtnXLv3 84L80QP1nTKP8Fg==; Domain=dtek-kem.com.ua; Path=/; SameSite=None; Secure
Set-Cookie: visid_incap_2224657=OOVTSrqKRCeH0QB7kzrgIC0dk2MAAAAAQUIPAAAAAAB47Nowjvq7LxL76cUkJG0a; Domain=dtek-kem.com.ua; expires=Fri, 08 Dec 2023 22:17:56 GMT; HttpOnly; Path=/; SameSite=None; Secure

and html content:

<html style="height:100%">
 <head>
  <meta content="NOINDEX, NOFOLLOW" name="ROBOTS"/>
  <meta content="telephone=no" name="format-detection"/>
  <meta content="initial-scale=1.0" name="viewport"/>
  <meta content="IE=edge,chrome=1" http-equiv="X-UA-Compatible"/>
  <script async="" src="/Physicken-Like-my-Hath-I-haue-ster-Banq-All-bids">
  </script>
 </head>
 <body style="margin:0px;height:100%">
  <iframe frameborder="0" height="100%" id="main-iframe" marginheight="0px" marginwidth="0px" src="/_Incapsula_Resource?SWUDNSAI=31&amp;xinfo=4-43048402-0 0NNN RT(1670585645218 54) q(0 -1 -1 -1) r(0 -1) B12(4,316,0) U2&amp;incident_id=287000410527500428-206407667178998340&amp;edet=12&amp;cinfo=04000000&amp;rpinfo=0&amp;cts=swfgpEczXy9hSsxHaaLf43gsGYhnGBhKA1jABnA0Ljuov3FUOG0mGjfE6li1tAg6&amp;mth=GET" width="100%">
   Request unsuccessful. Incapsula incident ID: 287000410527500428-206407667178998340
  </iframe>
 </body>
</html>

I completely copied the headers for the request from the network tab by going to the site through the browser and choosing first packet send to server first packet send When doing this, I get different responses from the server. Doesn't the server receive absolutely identical requests? response from browser request:

access-control-allow-credentials: true
access-control-allow-credentials: true
access-control-allow-headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type
access-control-allow-headers: DNT,X-CustomHeader,Keep-Alive,User-Agent,X-Requested-With,If-Modified-Since,Cache-Control,Content-Type
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-methods: GET, POST, OPTIONS
access-control-allow-origin: https://admin.dtek-kem.com.ua
cache-control: no-store, no-cache, must-revalidate
cache-control: max-age=900
cache-control: public, max-age=900
cache-control: no-store, no-cache, must-revalidate, proxy-revalidate, max-age=0
content-encoding: gzip
content-type: text/html; charset=UTF-8
date: Fri, 09 Dec 2022 12:02:38 GMT
expect-ct: enforce; max-age=3600
expect-ct: enforce; max-age=3600
expires: Thu, 19 Nov 1981 08:52:00 GMT
pragma: no-cache
referrer-policy: strict-origin-when-cross-origin
server: nginx
path=/; secure; secure; HttpOnly
status: 200
httpVersion: http/2.0
cookies: [{'name': 'dtek-kem', 'value': '0mspqled433d6pq7t9q9ttcjos'}, {'name': '_csrf-dtek-kem', 'value': '0957f055f621ade8b7c6a5136201e0081a1579972aa33443a65646c44afeb161a:2:{i:0;s:14:"_csrf-dtek-kem";i:1;s:32:"aJodoGWonH3u7fdI7jVzex4n6yBPZ9qX";}'}, {'name': 'Domain', 'value': 'dtek-kem.com.ua'}, {'name': 'incap_wrt_356', 'value': '3iOTYwAAAAA3Gkt0FwAI5AIQxJuq1AEYicrMnAYgAijdx8ycBknxuwb65PIpngUwOmGF xE='}]
content: {'size': 635168, 'mimeType': 'text/html'}

Am I entering in a big theme like "bypass firewall" or I missing something

CodePudding user response:

Requests

Requests work fine if you pass "incap_ses_1612_2224657" cookie to session:

import requests
import urllib.parse
from bs4 import BeautifulSoup as bs

url = r'https://www.dtek-kem.com.ua'
s = requests.Session()
s.cookies['incap_ses_1612_2224657'] = 'oRiXXtkFuiaomXJJnfleFu98mGMAAAAACfnEff2NJ ZJhjCB4Sr2Zw=='
r = s.get(urllib.parse.urljoin(url, 'ua/shutdowns'))
soup = bs(r.content, 'lxml')

So it's not a big theme like "bypass firewall", the site is pretty fine. Furthermore reCAPTCHA is bypassed in browser by simply updating the page with F5. Cookie can be taken from there and used for a while as far as session is active.
Yet I don't know how to get it with requests alone, sometimes it get's full cookies on it's own, headers don't really matter.

Make a table

Now, how would we prepare a table without using rendering and things like Scrapy, dryscrape, requests_html and other cool but resource heavy libraries?
In certain cases those would be helpful, but here the data can be acquired with or even alone. We need just a single <script> element from the webpage that contains all the needed information.

Get the table data

import re
import json

d = soup.find_all(lambda tag: tag.name == 'script' and not tag.attrs)[-1].decode_contents()
d_parsed = {}
for i in re.findall(r'(?<=DisconSchedule\.)(\w )(?:\s=\s)(. )',d):
    d_parsed[i[0]] = json.loads(i[1])
d = d_parsed

Now d variable contains a dictionary object with street names, current day of the week and data with table values that represent a some sort of a 3-dimensional table that will need some further parsing.
But first we'll need to get house information with a post request:

csrf = soup.find('meta', {'name': 'csrf-token'})['content']
headers = {
    'X-CSRF-Token': csrf,
    'Content-Type': 'application/x-www-form-urlencoded; charset=UTF-8'
}
body = 'method=getHomeNum&data[0][name]=street&data[0][value]=' d['streets'][193]
r = s.post(urllib.parse.urljoin(url, '/ua/ajax'), body.encode('utf-8'), headers=headers)
house = json.loads(r.content)['data']['20']
house
Output:
{'sub_type': 'Застосування стабілізаційних графіків',
 'start_date': '1670926920',
 'end_date': '16:00 13.12.2022',
 'type': '2',
 'sub_type_reason': ['1']}

Here we need some headers for sure. Specify content type and pass a token. Cookies are already in the session. The body of this query contains a street name d['streets'][193] is 'вул. Газопровідна'.
Response has some useful information that is rendered in a div above the table with yellow background. So, worth having it.

But what we are looking for is a "sub_type_reason". This is the 3rd dimension I was talking about. It is shown right to the house number and stands for 'Група' 1 / 2 / 3. There might be more groups at some point.

For this particular address "вул. Газопровідна 20" we'll be using group 1.

Build a table

I'll be using for this. We'll be doing some modifications further, so pandas will be great in this case.

gr = house['sub_type_reason'][0]
df = pd.DataFrame({int(k):d['preset']['data'][gr][k].values() for k in d['preset']['days'].keys()})
df
Output:

    1       2       3       4       5       6       7
0   no      maybe   no      no      maybe   no      no
1   no      maybe   yes     no      maybe   yes     no
2   no      maybe   yes     no      maybe   yes     no
3   no      no      maybe   no      no      maybe   no
4   yes     no      maybe   yes     no      maybe   yes
5   yes     no      maybe   yes     no      maybe   yes
6   maybe   no      no      maybe   no      no      maybe
7   maybe   yes     no      maybe   yes     no      maybe
8   maybe   yes     no      maybe   yes     no      maybe
9   no      maybe   no      no      maybe   no      no
10  no      maybe   yes     no      maybe   yes     no
11  no      maybe   yes     no      maybe   yes     no
12  no      no      maybe   no      no      maybe   no
13  yes     no      maybe   yes     no      maybe   yes
14  yes     no      maybe   yes     no      maybe   yes
15  maybe   no      no      maybe   no      no      maybe
16  maybe   yes     no      maybe   yes     no      maybe
17  maybe   yes     no      maybe   yes     no      maybe
18  no      maybe   no      no      maybe   no      no
19  no      maybe   yes     no      maybe   yes     no
20  no      maybe   yes     no      maybe   yes     no
21  no      no      maybe   no      no      maybe   no
22  yes     no      maybe   yes     no      maybe   yes
23  yes     no      maybe   yes     no  maybe   yes

Okay, great!
Basically this is the same table you see on the website but without icons for electricity and transposed as it is viewed in mobile version.
d['preset']['time_type']:

{'yes': 'Світло є', 'maybe': 'Можливо відключення', 'no': 'Світла немає'}

Modify a table

As per your screenshot this is something you want to get. As far as I understand it, it's about collapsing 'yes' and 'maybe' values into one row with an overlapping time period.
That's challenging, but can be done.

from operator import itemgetter
from itertools import groupby

row = ['']*len(df.columns)
df = df.replace(['no'],'').replace(['yes','maybe'],True)
collapsed_df = pd.DataFrame(columns=df.columns)
for col_ix, col in enumerate(df.columns):
    for k,g in groupby(enumerate(df.groupby(df[col], axis=0).get_group(True)[col].index), lambda x: x[0]-x[1]):
        intervals = list(map(itemgetter(1), g))
        interval = pd.Interval(intervals[0], intervals[-1] 1, closed='both')
        if interval not in collapsed_df.index:
            collapsed_df.loc[interval] = list(row)
        collapsed_df.loc[interval].iloc[col_ix] = True
df = collapsed_df.sort_index()
df
Output:
            1       2       3       4       5       6       7
[0, 3]              True                    True        
[1, 6]                      True                    True    
[4, 9]      True                    True                    True
[7, 12]             True                    True        
[10, 15]                    True                    True    
[13, 18]    True                    True                    True
[16, 21]            True                    True        
[19, 24]                    True                    True    
[22, 24]    True                    True                    True

I'm not going to describe in details the magic behind collapsing columns as the answer would be too long. And I'm more than sure that this piece of code can be done better.
In a few words, I iterate through each row to find groups of consecutive values and collapse their indices. Collapsed indices are casted as intervals and true value is added to a row with corresponding interval. Row is created on first appearance with empty values.

Anyway, done.
It has same output as your screenshot but data is different as we're on a different day and data has changed so far.
Now what is left is to cast index values that stand for hour intervals to hours string, change columns and prettify the table to depict your screenshot.

Final touch

  • download images and encode them to base64
  • replace True values with <img> tag and binary source
  • cast index to string type time periods
  • assign column names
  • put an index name, here I use df.columns.name as otherwise, by naming index, table head will have two rows
  • style the table
    • collapse table, add gray border and change font-size
    • color the header background, show text as black
    • put a line separating 'Години' from week names as shown on your screenshot
    • add border between columns, change cells size
    • adjust font-weight
    • make current weekday bold
    • change icons size
    • set background color for filled cells
from base64 import b64encode

img = {
    'maybe': b64encode(s.get(urllib.parse.urljoin(url,'media/page/maybe-electricity.png')).content),
    'no': b64encode(s.get(urllib.parse.urljoin(url,'media/page/no-electricity.png')).content)
df = df.replace(True, '<img src="data:image/webp;base64,' re.sub(r"^b'|'$",'',str(img['no'])) '"></img>')

df.index = ['{:02d}:00 – {:02d}:00'.format(i.left, i.right) for i in df.index]
df.columns = ['Пн','Вт','Ср','Чт','Пт','Сб','Нд']
df.columns.name = 'Години'

styled_df = df.style.set_table_styles([
    {'selector': '',
    'props': [
        ('border-collapse', 'collapse'),
        ('border', '1px solid #cfcfcf'),
        ('font-size', '20px')
    ]},
    {'selector': 'thead tr',
    'props': [
        ('background-color', '#ffe500'),
        ('color', 'black'),
        ('height', '70px')
    ]},
    {'selector': 'thead tr th:first-child',
    'props': [
        ('border', '1px solid #cfcfcf'),
        ('width', '240px')
    ]},
    {'selector': 'td',
    'props': [
        ('border-left', '1px solid #cfcfcf'),
        ('text-align', 'center'),
        ('width', '95px'),
        ('height', '56px')
    ]},
    {'selector': 'td, th',
    'props': [
        ('font-weight', 'lighter')
    ]},
    {'selector': 'thead tr th:nth-child({})'.format(d['currentWeekDayIndex'] 1),
    'props': [
        ('font-weight', 'bold')
    ]},
    {'selector': 'img',
    'props': [
        ('height', '23px'),
        ('width', '21px')
    ]},
        {'selector': 'td:has(> img)',
    'props': [
        ('background-color', '#f4f4f4')
    ]}
])
}

styled_df.to_html(escape=False, border=0, encoding='utf-8')
Output:

const image_bin = " B7Z 54C4igPGugfxIuQZfcw9Va7j6dggBJ FHefLj2YDYX7cprINPfBv2bh I95KIsSe3zbnnzJTwJ/gk0hs9BiCR3HzMUcDDyFgQJGP6pAdOFxvAi4R8YNIEzd7QZKwER 1BVhEoDIZLyIkYCjhzYMSVr0IgEn4EgyfAf/4MEbGpfQJDk7gNtl9AH/x4EScY92G4Jv3UjoKSlEpzkLDCQTHuybRI8gGjtB5KqAu2SIDsCBCoFBNojoZOBMIZSeNGu9jAMN0ZfpHoBVcwW 6EHXwYe bk7WgBhakVAyWwjwauMKiQBgVlJwNf/AAqQBQRmI6GfgQIsAYE XO2Pe1xPwlw7msA/Ot1JzyhIOA3lQQHSKDDMgl05xH9fn/7iUqODzzx09qAmlLxmhIDVr/jLVpVraIi8i8PoelonalhNoMj Vt98YEbTZgtwvVmnILMP6JBqajYSzHmdUiwBCSP/17AEh/sHteYNLAE4HWZlgJuS4OFyre6ymSwgdFgQ0Vnpzxj9DiVnwIiA6y5EoiUBh7RXuTveBwIMATL5fwUJrg/JLhBhCJDbARKUkBftnj5bJAnQ2AGSkICd8jY3V0iNAKmZ2i1iJGC738vd0QEwIQq4VFmQBJgSHC7NdyGCDqXwRf7953z6aAmUIgHrc3k fWh/5d/ewcTP4C7up0tvcEH GPcIXl7kP6IqhbUYClvq9Usbi/0Gaaude0iDA3s1WJfUrqyjgEMg0pSEiIRIXQyrufBPqtBQF2Ai5g1NSFAX4CPPDWlLaCACTPSIoSlBVcDNyjEqAkq0JGhHgOh8QUOCsoBr8QmTtARlAcZCBCG5IXB6bSrKAvhDYJncEDrCNxFVATgCcDvAW8kNTQlt7ATHJje0JKgJuNk8ITMtuaEhQTECOhaIFO1 enJDWoKiAPIMMCQ3tusUlJSgJoC4CHKh3RPKi0lQE0BZBGGaa4eT1JSQoBgB9ZpAkdTM3gKTWAkqGaGb/f7K3Vmc6WU42SGF/iS4R/i0IqBO7TtMaoqlvLiRoCSgehHk4eqF9MFnjgQlAcZOfXbQ7mW 9nYXqgQlAZOHQD9ok3GbGVVQJKgImLIIctju96AB6kpotBOknNyQoI4E8WEwLIIMzJ2Mvmk4uUE7vCDFpCEy7DM2sDHCO7khybhIwArZCgcqFQTcczBo639hndyQZlhCcfPHvXCtNRO0xVki8wC7mQOtc74cwsbu8DdP/wD8xrK5i7NwCwAAAABJRU5ErkJggg=="
var images = document.getElementsByTagName("img")
for (var i = 0; i < images.length; i  ) {
    images[i].src = image_bin;
}
#T_b04e1  {
  border-collapse: collapse;
  border: 1px solid #cfcfcf;
  font-size: 20px;
}
#T_b04e1 thead tr {
  background-color: #ffe500;
  color: black;
  height: 70px;
}
#T_b04e1 thead tr th:first-child {
  border: 1px solid #cfcfcf;
  width: 240px;
}
#T_b04e1 td {
  border-left: 1px solid #cfcfcf;
  text-align: center;
  width: 95px;
  height: 56px;
}
#T_b04e1 td {
  font-weight: lighter;
}
#T_b04e1  th {
  font-weight: lighter;
}
#T_b04e1 thead tr th:nth-child(3) {
  font-weight: bold;
}
#T_b04e1 img {
  height: 23px;
  width: 21px;
}
#T_b04e1 td:has(> img) {
  background-color: #f4f4f4;
}
<table id="T_b04e1">
  <thead>
    <tr>
      <th  >Години</th>
      <th id="T_b04e1_level0_col0"  >Пн</th>
      <th id="T_b04e1_level0_col1"  >Вт</th>
      <th id="T_b04e1_level0_col2"  >Ср</th>
      <th id="T_b04e1_level0_col3"  >Чт</th>
      <th id="T_b04e1_level0_col4"  >Пт</th>
      <th id="T_b04e1_level0_col5"  >Сб</th>
      <th id="T_b04e1_level0_col6"  >Нд</th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <th id="T_b04e1_level0_row0"  >00:00 – 03:00</th>
      <td id="T_b04e1_row0_col0"  ></td>
      <td id="T_b04e1_row0_col1"  ><img></img></td>
      <td id="T_b04e1_row0_col2"  ></td>
      <td id="T_b04e1_row0_col3"  ></td>
      <td id="T_b04e1_row0_col4"  ><img></img></td>
      <td id="T_b04e1_row0_col5"  ></td>
      <td id="T_b04e1_row0_col6"  ></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row1"  >01:00 – 06:00</th>
      <td id="T_b04e1_row1_col0"  ></td>
      <td id="T_b04e1_row1_col1"  ></td>
      <td id="T_b04e1_row1_col2"  ><img></img></td>
      <td id="T_b04e1_row1_col3"  ></td>
      <td id="T_b04e1_row1_col4"  ></td>
      <td id="T_b04e1_row1_col5"  ><img></img></td>
      <td id="T_b04e1_row1_col6"  ></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row2"  >04:00 – 09:00</th>
      <td id="T_b04e1_row2_col0"  ><img></img></td>
      <td id="T_b04e1_row2_col1"  ></td>
      <td id="T_b04e1_row2_col2"  ></td>
      <td id="T_b04e1_row2_col3"  ><img></img></td>
      <td id="T_b04e1_row2_col4"  ></td>
      <td id="T_b04e1_row2_col5"  ></td>
      <td id="T_b04e1_row2_col6"  ><img></img></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row3"  >07:00 – 12:00</th>
      <td id="T_b04e1_row3_col0"  ></td>
      <td id="T_b04e1_row3_col1"  ><img></img></td>
      <td id="T_b04e1_row3_col2"  ></td>
      <td id="T_b04e1_row3_col3"  ></td>
      <td id="T_b04e1_row3_col4"  ><img></img></td>
      <td id="T_b04e1_row3_col5"  ></td>
      <td id="T_b04e1_row3_col6"  ></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row4"  >10:00 – 15:00</th>
      <td id="T_b04e1_row4_col0"  ></td>
      <td id="T_b04e1_row4_col1"  ></td>
      <td id="T_b04e1_row4_col2"  ><img></img></td>
      <td id="T_b04e1_row4_col3"  ></td>
      <td id="T_b04e1_row4_col4"  ></td>
      <td id="T_b04e1_row4_col5"  ><img></img></td>
      <td id="T_b04e1_row4_col6"  ></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row5"  >13:00 – 18:00</th>
      <td id="T_b04e1_row5_col0"  ><img></img></td>
      <td id="T_b04e1_row5_col1"  ></td>
      <td id="T_b04e1_row5_col2"  ></td>
      <td id="T_b04e1_row5_col3"  ><img></img></td>
      <td id="T_b04e1_row5_col4"  ></td>
      <td id="T_b04e1_row5_col5"  ></td>
      <td id="T_b04e1_row5_col6"  ><img></img></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row6"  >16:00 – 21:00</th>
      <td id="T_b04e1_row6_col0"  ></td>
      <td id="T_b04e1_row6_col1"  ><img></img></td>
      <td id="T_b04e1_row6_col2"  ></td>
      <td id="T_b04e1_row6_col3"  ></td>
      <td id="T_b04e1_row6_col4"  ><img></img></td>
      <td id="T_b04e1_row6_col5"  ></td>
      <td id="T_b04e1_row6_col6"  ></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row7"  >19:00 – 24:00</th>
      <td id="T_b04e1_row7_col0"  ></td>
      <td id="T_b04e1_row7_col1"  ></td>
      <td id="T_b04e1_row7_col2"  ><img></img></td>
      <td id="T_b04e1_row7_col3"  ></td>
      <td id="T_b04e1_row7_col4"  ></td>
      <td id="T_b04e1_row7_col5"  ><img></img></td>
      <td id="T_b04e1_row7_col6"  ></td>
    </tr>
    <tr>
      <th id="T_b04e1_level0_row8"  >22:00 – 24:00</th>
      <td id="T_b04e1_row8_col0"  ><img></img></td>
      <td id="T_b04e1_row8_col1"  ></td>
      <td id="T_b04e1_row8_col2"  ></td>
      <td id="T_b04e1_row8_col3"  ><img></img></td>
      <td id="T_b04e1_row8_col4"  ></td>
      <td id="T_b04e1_row8_col5"  ></td>
      <td id="T_b04e1_row8_col6"  ><img></img></td>
    </tr>
  </tbody>
</table>

The output is a copy-paste of the styled_df.to_html() output, so it's a fully generated one.
I only added a small js code to distribute the repetitive image binary through <img src=""> to save characters in this answer. This is the only thing done manually in making the snippet, you may automate it with regex or other means if you need.

Output can be saved to a file by adding buf:

styled_df.to_html(buf='lovely_table.html', escape=False, border=0, encoding='utf-8')

You may now play with columns collapsing and do it separately on 'yes' and 'maybe' to get different results that suit your needs.

  • Related