Hi everyone I would like to scrape N.01. But to no avail. I'm a self learning newbie, so would help to solve this problem :)
This is the HTML part that I am interested in:
<div >
<div id="14001" onclick="dunSelected(this.id)">
<span id="kod-dun" >N.01</span>
<span id="nama-dun" >BULOH KASAP</span>
</div>
</div>
This is my code:
from bs4 import BeautifulSoup
import requests
html_text = requests.get('https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun').text
soup = BeautifulSoup(html_text,'lxml')
dun = soup.find('div', class_='justify-between font-semibold')
a = dun.find('span', attrs={'id': 'kod-dun'})
b = dun.find('span', attrs={'class': ''})
print(a)
print(b)
The result is this:
<span id="kod-dun"><!--Kod DUN--></span>
<span id="kod-dun"><!--Kod DUN--></span>
CodePudding user response:
You cannot do this directly with Beautiful Soup, as the N.01
value is not contained in the HTML you get back from https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun. The N.01
value is downloaded using an AJAX request, and then inserted into the DOM using JavaScript.
$.ajax({
type: "POST",
url: 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun',
dataType: "json",
data: formData,
success: function (data) {
if(data.data == true) {
$("#parlimenDunSection").removeClass('hidden');
$("#tab-dun-web").empty();
$.each(data['dun'], function (key, value) {
//Keluarkan Dun
$("#tab-dun-web").append('<div ><div id="' value.AppendKod '" onclick="dunSelected(this.id)"><span id="kod-dun">N.' value.KodBahagianPilihanRaya '</span><span id="nama-dun">' value.NamaBahagianPilihanRaya '</span></div></div>');
});
}
},
});
Note that the HTML added using the append
method contains the following snippet:
<span id="kod-dun">N.' value.KodBahagianPilihanRaya '</span>
There are two basic routes you can take to scrape this:
- Use the API at https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun directly. This is the simplest way in theory, but is made more complicated due to the API requiring XSRF tokens before it will return the data to you. See e.g. this post for how to explore the API using the developer tools in your browser.
- Use Selenium, possibly with Selenium IDE, to control an actual browser using code. This has the advantage that the browser will run the JavaScript for you, so you can fetch the value from the main page after the JavaScript has finished running. However, this approach takes a bit more work to set up, and the resulting scripts tend to be slower and more error-prone. (You need to script the browser interactions necessary to get the data, such as clicking buttons and opening menus, and these scripts are prone to synchronisation problems where the script tries to interact with the browser before the browser is ready. Also, these scripts tend to break more easily if the web page changes.)
CodePudding user response:
As stated, the best way is API. And as stated, it was a made a little more complicated ni that you need to provide some token and a few other parameters. Good news is, you can grab that info in the site.
Note, you'll need to pip install choice
to implement the user input option I provide:
import requests
from bs4 import BeautifulSoup
import re
import json
import choice
import pandas as pd
s = requests.Session()
url = 'https://mysprsemak.spr.gov.my/semakan/keputusan/pru-dun'
headers = {'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/96.0.4664.110 Safari/537.36'}
response = s.get(url, headers=headers).text
cookies = s.cookies.get_dict()
cookieStr = ''
for k, v in cookies.items():
cookieStr = f'{k}={v};'
headers.update({'Cookie':cookieStr})
soup = BeautifulSoup(response, 'html.parser')
options = soup.find_all('option')[1:]
optionsDict = {each.text: [each['value'],each['data-negeri']] for each in options}
getInput = choice.Menu(list(optionsDict.keys())).ask()
jsonStr = re.search('var formData = ({.*});', response, flags=re.DOTALL).group(1).split(',')[0] '}'
token = json.loads(jsonStr)['_token']
postURL = 'https://mysprsemak.spr.gov.my/semakan/keputusan/senaraiDunPruDun'
payload = {
'_token':token,
'PilihanrayaId':optionsDict[getInput][0],
'RefNegeriId':optionsDict[getInput][1]}
jsonData = s.post(postURL, headers=headers, data=payload).json()
df = pd.json_normalize(jsonData['dun'])
Output:
Make a choice:
0: PRU DEWAN NEGERI JOHOR KE-15
1: PRU DUN SARAWAK KE-12 (2021)
2: PRU DUN MELAKA KE-15 (2021)
3: PRU DUN SABAH KALI KE-16 (2020)
4: PRU DUN SARAWAK KE-11 (2016)
Enter number or name; return for next page
? 0
Dataframe:
print(df)
BahagianPilihanRayaId ... AppendKod
0 956EC2DB-2952-414E-835D-831611009C27 ... 14001
1 956EC2DB-EAD5-4656-B9A5-3DB0AA626BC1 ... 14002
2 956EC2DC-18ED-4711-BCD1-4D2CFB473EE1 ... 14103
3 956EC2DB-5E73-47FB-9F23-4A7F6EE26EDC ... 14104
4 956EC2DB-DEC7-45F9-9F60-9B398D678D2F ... 14205
5 956EC2DC-5776-4062-8E43-E8F18454DAC1 ... 14206
6 956EC2DC-B0FD-45D3-90A2-F1F8C4CA31B2 ... 14307
7 956EC2DC-D419-4801-A11D-65C04F7CF7FC ... 14308
8 956EC2DB-F683-4300-B9A0-7679C676DCCF ... 14409
9 956EC2DA-827A-42DF-8D34-3842D49113FE ... 14410
10 956EC2DB-CE81-4EB4-A6A4-D003EF5F845F ... 14411
11 956EC2DC-0DF7-45A9-B57D-28F252EB4349 ... 14512
12 956EC2DA-9C70-4DB1-A67F-BB5D0DC7AB49 ... 14513
13 956EC2DC-01E6-4D8E-A9A0-930EE420CE04 ... 14514
14 956EC2DC-630F-4EBF-9C1B-8F93FFAEDA96 ... 14615
15 956EC2DA-D04A-4E9E-BBA6-F5487168BB4C ... 14616
16 956EC2DB-6B1F-4EFC-9D19-16A84738E680 ... 14717
17 956EC2DB-35E7-4C14-AA60-F0B79B6E7CB2 ... 14718
18 956EC2DB-927C-4705-A347-F0D81D5D3BE4 ... 14819
19 956EC2DA-EB67-4359-9FC1-EF749DD6E296 ... 14820
20 956EC2DC-A4FC-42F8-9FB8-7669D120EC5F ... 14921
21 956EC2DA-C381-47C8-9014-25EFCD8B064C ... 14922
22 956EC2DC-EC01-4D68-8171-8BE026CCBF83 ... 15023
23 956EC2DC-6E20-4EB1-9C47-88D9B72E5889 ... 15024
24 956EC2DB-523C-4998-AC02-33F602E07305 ... 15025
25 956EC2DC-C7DC-47EC-BDD8-73A1440788BE ... 15126
26 956EC2DB-84E2-4BB6-97FD-C996983A76B1 ... 15127
27 956EC2DA-59D1-420C-8155-80F2156FB446 ... 15228
28 956EC2DA-70E6-40D8-A3A3-E8E4FBC45E40 ... 15229
29 956EC2DC-4799-4624-9C07-18D1DAC6B0F7 ... 15330
30 956EC2DD-20B8-4F28-90B9-1519628EF751 ... 15331
31 956EC2DA-4BFD-4909-8D93-8A2207FD247B ... 15432
32 956EC2DA-B4A3-49BD-8CC0-BDCE81B7071E ... 15433
33 956EC2DA-A8EE-4244-9052-9D14ECDD1A8C ... 15534
34 956EC2DC-3B33-4627-BFB5-DAF87988909A ... 15535
35 956EC2DC-3009-44FA-874D-C575599210DC ... 15636
36 956EC2DC-BC61-4202-962A-66415635FDDE ... 15637
37 956EC2DC-24AE-41BC-AD3E-7D48D8E752D3 ... 15738
38 956EC2DB-02E6-439A-9E97-B18F65EB9432 ... 15739
39 956EC2DC-928E-4837-A049-18BFD698A8B2 ... 15840
40 956EC2DB-782A-4E8E-ADE3-07A9AC786F9E ... 15841
41 956EC2DC-8620-4CD0-8225-60DB217E5EB1 ... 15942
42 956EC2DA-F787-4965-A099-E29794CBA951 ... 15943
43 956EC2DB-BF64-4B83-9240-49701039EA2A ... 16044
44 956EC2DC-7A25-4870-9276-598C7041CD27 ... 16045
45 956EC2DB-1238-45A6-A935-469825659736 ... 16146
46 956EC2DD-06B6-42E1-9BB8-FC300EF40688 ... 16147
47 956EC2DC-DFD2-45A8-8B15-5C831DD0E14C ... 16248
48 956EC2DA-DE6C-4A14-93CA-BB757A5085F7 ... 16249
49 956EC2DA-901D-465F-B795-7CF2D807E394 ... 16350
50 956EC2DB-42DB-4949-B4DE-025EAAB779B9 ... 16351
51 956EC2DB-A09B-4A7A-A122-15208F9D0155 ... 16352
52 956EC2DA-6599-4283-8DC2-1496100717E3 ... 16453
53 956EC2DB-1DD6-4659-AF97-D448477A546C ... 16454
54 956EC2DB-AF15-4069-86A4-035F2E3A7ED4 ... 16555
55 956EC2DC-F814-4B4C-A65F-79B75AD25C62 ... 16556
[56 rows x 4 columns]