I'm trying to scrap data from this website "https://quranromanurdu.com/chapter/1" , I want only text or content from id-contentpara and return that content in JSON format, this below code gives html content but i want that to convert to JSON. I tried to convert but I'm getting error , please somebody help me to clear this error
python code :
import requests
from bs4 import BeautifulSoup
import json
import codecs
URL = "https://quranromanurdu.com/chapter/1"
page = requests.get(URL)
soup = BeautifulSoup(page.content, "html.parser")
table = soup.findAll('div',attrs={"id":"contentpara"})
data0 = json.loads(table)
print(data0)
Error
line 24, in <module>
data0 = json.loads(table)
File "C:\Users\arbazalx\AppData\Local\Programs\Python\Python310\lib\json\__init__.py", line 339, in loads
raise TypeError(f'the JSON object must be str, bytes or bytearray, '
TypeError: the JSON object must be str, bytes or bytearray, not ResultSet
CodePudding user response:
The error tells you what's wrong, findAll
returns a ResultSet which you need to convert to a str
, this answer should help https://stackoverflow.com/a/20969389/16716191. Also findAll
was renamed to find_all
.
CodePudding user response:
You can do like this,
... your code ...
table = soup.findAll('div',attrs={"id":"contentpara"})
values = list(filter(None, table[0].text.split('\n')))
values = list(filter(None, [value.replace("\xa0", "") for value in values[1:]]))
d = {}
for item in values:
key, value = item.split('.', maxsplit=1)
d[key] = value
Output:
{'1': ' Allah ke naam se jo Rehman o Raheem hai.',
'2': ' Tareef Allah hi ke liye hai jo tamaam qayinaat ka Rubb hai.',
'3': ' Rehman aur Raheem hai.',
'4': ' Roz e jaza ka maalik hai.',
'5': ' Hum teri hi ibadat karte hain aur tujh hi se madad maangte hain.',
'6': ' Humein seedha raasta dikha.',
'7': ' Un logon ka raasta jinpar tu nay inam farmaya, jo maatoob nahin huey (na unka jinpar tera gazab hota raha) , jo bhatke huey nahin hain.'}