Home > Mobile >  Webscrape SPAN using python and soup
Webscrape SPAN using python and soup

Time:03-17

Hi everyone I would like to scrape N.01. But to no avail. I'm a self learning newbie, so would help to solve this problem :)

This is the HTML part that I am interested in:

    <div >

            <div id="14001"  onclick="dunSelected(this.id)">

                <span id="kod-dun" >N.01</span>
                <span id="nama-dun" >BULOH KASAP</span>
            </div>
     </div>

This is my code:

from bs4 import BeautifulSoup
import requests

html_text = requests.get('https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun').text
soup = BeautifulSoup(html_text,'lxml')
dun = soup.find('div', class_='justify-between font-semibold')
a = dun.find('span', attrs={'id': 'kod-dun'})
b = dun.find('span', attrs={'class': ''})

print(a)
print(b)

The result is this:

<span id="kod-dun"><!--Kod DUN--></span>
<span id="kod-dun"><!--Kod DUN--></span>

CodePudding user response:

You cannot do this directly with Beautiful Soup, as the N.01 value is not contained in the HTML you get back from https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun. The N.01 value is downloaded using an AJAX request, and then inserted into the DOM using JavaScript.

$.ajax({
    type: "POST",
    url: 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun',
    dataType: "json",
    data: formData,
    success: function (data) {
        if(data.data == true) {
            $("#parlimenDunSection").removeClass('hidden');
            $("#tab-dun-web").empty();

            $.each(data['dun'], function (key, value) {
                //Keluarkan Dun
                $("#tab-dun-web").append('<div ><div id="' value.AppendKod '"  onclick="dunSelected(this.id)"><span id="kod-dun">N.' value.KodBahagianPilihanRaya '</span><span id="nama-dun">' value.NamaBahagianPilihanRaya '</span></div></div>');
            });
        }
    },
});

Note that the HTML added using the append method contains the following snippet:

<span id="kod-dun">N.' value.KodBahagianPilihanRaya '</span>

There are two basic routes you can take to scrape this:

  1. Use the API at https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun directly. This is the simplest way in theory, but is made more complicated due to the API requiring XSRF tokens before it will return the data to you. See e.g. this post for how to explore the API using the developer tools in your browser.
  2. Use Selenium, possibly with Selenium IDE, to control an actual browser using code. This has the advantage that the browser will run the JavaScript for you, so you can fetch the value from the main page after the JavaScript has finished running. However, this approach takes a bit more work to set up, and the resulting scripts tend to be slower and more error-prone. (You need to script the browser interactions necessary to get the data, such as clicking buttons and opening menus, and these scripts are prone to synchronisation problems where the script tries to interact with the browser before the browser is ready. Also, these scripts tend to break more easily if the web page changes.)

CodePudding user response:

As stated, the best way is API. And as stated, it was a made a little more complicated ni that you need to provide some token and a few other parameters. Good news is, you can grab that info in the site.

Note, you'll need to pip install choice to implement the user input option I provide:

import requests
from bs4 import BeautifulSoup
import re 
import json
import choice
import pandas as pd

s = requests.Session()

url = 'https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = s.get(url, headers=headers).text
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
    cookieStr  = f'{k}={v};'

headers.update({'Cookie':cookieStr})
soup = BeautifulSoup(response, 'html.parser')
options = soup.find_all('option')[1:]

optionsDict = {each.text: [each['value'],each['data-negeri']] for each in options}
getInput = choice.Menu(list(optionsDict.keys())).ask()


jsonStr = re.search('var formData = ({.*});', response, flags=re.DOTALL).group(1).split(',')[0]  '}'
token = json.loads(jsonStr)['_token']

postURL = 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun'

payload = {
    '_token':token,
    'PilihanrayaId':optionsDict[getInput][0],
    'RefNegeriId':optionsDict[getInput][1]}


jsonData = s.post(postURL, headers=headers, data=payload).json()
df = pd.json_normalize(jsonData['dun'])

Output:

Make a choice:
 0: PRU DEWAN NEGERI JOHOR KE-15
 1: PRU DUN SARAWAK KE-12 (2021)
 2: PRU DUN MELAKA KE-15 (2021)
 3: PRU DUN SABAH KALI KE-16 (2020)
 4: PRU DUN SARAWAK KE-11 (2016)

Enter number or name; return for next page

? 0

Dataframe:

print(df)
                   BahagianPilihanRayaId  ... AppendKod
0   956EC2DB-2952-414E-835D-831611009C27  ...     14001
1   956EC2DB-EAD5-4656-B9A5-3DB0AA626BC1  ...     14002
2   956EC2DC-18ED-4711-BCD1-4D2CFB473EE1  ...     14103
3   956EC2DB-5E73-47FB-9F23-4A7F6EE26EDC  ...     14104
4   956EC2DB-DEC7-45F9-9F60-9B398D678D2F  ...     14205
5   956EC2DC-5776-4062-8E43-E8F18454DAC1  ...     14206
6   956EC2DC-B0FD-45D3-90A2-F1F8C4CA31B2  ...     14307
7   956EC2DC-D419-4801-A11D-65C04F7CF7FC  ...     14308
8   956EC2DB-F683-4300-B9A0-7679C676DCCF  ...     14409
9   956EC2DA-827A-42DF-8D34-3842D49113FE  ...     14410
10  956EC2DB-CE81-4EB4-A6A4-D003EF5F845F  ...     14411
11  956EC2DC-0DF7-45A9-B57D-28F252EB4349  ...     14512
12  956EC2DA-9C70-4DB1-A67F-BB5D0DC7AB49  ...     14513
13  956EC2DC-01E6-4D8E-A9A0-930EE420CE04  ...     14514
14  956EC2DC-630F-4EBF-9C1B-8F93FFAEDA96  ...     14615
15  956EC2DA-D04A-4E9E-BBA6-F5487168BB4C  ...     14616
16  956EC2DB-6B1F-4EFC-9D19-16A84738E680  ...     14717
17  956EC2DB-35E7-4C14-AA60-F0B79B6E7CB2  ...     14718
18  956EC2DB-927C-4705-A347-F0D81D5D3BE4  ...     14819
19  956EC2DA-EB67-4359-9FC1-EF749DD6E296  ...     14820
20  956EC2DC-A4FC-42F8-9FB8-7669D120EC5F  ...     14921
21  956EC2DA-C381-47C8-9014-25EFCD8B064C  ...     14922
22  956EC2DC-EC01-4D68-8171-8BE026CCBF83  ...     15023
23  956EC2DC-6E20-4EB1-9C47-88D9B72E5889  ...     15024
24  956EC2DB-523C-4998-AC02-33F602E07305  ...     15025
25  956EC2DC-C7DC-47EC-BDD8-73A1440788BE  ...     15126
26  956EC2DB-84E2-4BB6-97FD-C996983A76B1  ...     15127
27  956EC2DA-59D1-420C-8155-80F2156FB446  ...     15228
28  956EC2DA-70E6-40D8-A3A3-E8E4FBC45E40  ...     15229
29  956EC2DC-4799-4624-9C07-18D1DAC6B0F7  ...     15330
30  956EC2DD-20B8-4F28-90B9-1519628EF751  ...     15331
31  956EC2DA-4BFD-4909-8D93-8A2207FD247B  ...     15432
32  956EC2DA-B4A3-49BD-8CC0-BDCE81B7071E  ...     15433
33  956EC2DA-A8EE-4244-9052-9D14ECDD1A8C  ...     15534
34  956EC2DC-3B33-4627-BFB5-DAF87988909A  ...     15535
35  956EC2DC-3009-44FA-874D-C575599210DC  ...     15636
36  956EC2DC-BC61-4202-962A-66415635FDDE  ...     15637
37  956EC2DC-24AE-41BC-AD3E-7D48D8E752D3  ...     15738
38  956EC2DB-02E6-439A-9E97-B18F65EB9432  ...     15739
39  956EC2DC-928E-4837-A049-18BFD698A8B2  ...     15840
40  956EC2DB-782A-4E8E-ADE3-07A9AC786F9E  ...     15841
41  956EC2DC-8620-4CD0-8225-60DB217E5EB1  ...     15942
42  956EC2DA-F787-4965-A099-E29794CBA951  ...     15943
43  956EC2DB-BF64-4B83-9240-49701039EA2A  ...     16044
44  956EC2DC-7A25-4870-9276-598C7041CD27  ...     16045
45  956EC2DB-1238-45A6-A935-469825659736  ...     16146
46  956EC2DD-06B6-42E1-9BB8-FC300EF40688  ...     16147
47  956EC2DC-DFD2-45A8-8B15-5C831DD0E14C  ...     16248
48  956EC2DA-DE6C-4A14-93CA-BB757A5085F7  ...     16249
49  956EC2DA-901D-465F-B795-7CF2D807E394  ...     16350
50  956EC2DB-42DB-4949-B4DE-025EAAB779B9  ...     16351
51  956EC2DB-A09B-4A7A-A122-15208F9D0155  ...     16352
52  956EC2DA-6599-4283-8DC2-1496100717E3  ...     16453
53  956EC2DB-1DD6-4659-AF97-D448477A546C  ...     16454
54  956EC2DB-AF15-4069-86A4-035F2E3A7ED4  ...     16555
55  956EC2DC-F814-4B4C-A65F-79B75AD25C62  ...     16556

[56 rows x 4 columns]
  • Related