Newbie to scrapy, been trying to crawl website data from https://www.citypopulation.de/en/southkorea/busan/admin/, but it is missing a record from the table.
Able to crawl the rest of the records with no issue, example:
<tbody >
<tr itemscope="" itemtype="http://schema.org/AdministrativeArea" onclick="javascript:sym('21080')"><td id="i21080" data-wiki="Buk District, Busan" data-wd="Q50394" data-area="39.726" data-density="7052.7362"><a href="javascript:sym('21080')"><span itemprop="name">Buk-gu</span></a> [<span itemprop="name">North Distrikt</span>]</td><td >City District</td><td ><span itemprop="name">북구</span></td><td >329,336</td><td >302,141</td><td >299,182</td><td >280,177</td><td ><a itemprop="url" href="/en/southkorea/busan/admin/21080__buk_gu/">→</a></td></tr>
</tbody>
Missing row when there is no link under the <td >
, example:
<tbody >
<tr><td >Busan</td><td >Metropolitan City</td><td ><span itemprop="name">부산광역시</span></td><td >3,523,582</td><td >3,414,950</td><td >3,448,737</td><td >3,349,016</td><td ></td></tr>
</tbody>
Code:
from gc import callbacks
import scrapy
class WebsiteItem(scrapy.Item):
item_name = scrapy.Field()
item_status = scrapy.Field()
class WebsiteSpider(scrapy.spiders.CrawlSpider):
name = "posts"
start_urls = ["https://www.citypopulation.de/en/southkorea/"]
rules = (
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="div#prov_div > ul > li > a"), follow=True),
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="table#tl > tbody > tr > td"), callback="parse")
)
def parse(self, response):
website_item = WebsiteItem()
website_item['item_name'] = response.css("td.rname span::text").get()
website_item['item_status'] = response.css("td.rstatus::text").get()
return website_item
I assume it is because of the rule that is enforcing to crawl based on link, but have no idea how to solve this while still loop through each records in the table.
rules = (
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="div#prov_div > ul > li > a"), follow=True),
scrapy.spiders.Rule(scrapy.linkextractors.LinkExtractor(restrict_css="table#tl > tbody > tr > td"), callback="parse")
)
Appreciate if anyone can help to point out what am I missing here.
CodePudding user response:
This is one way to get those name/status pairs:
import scrapy
import pandas as pd
class SkSpider(scrapy.Spider):
name = 'sk'
allowed_domains = ['citypopulation.de']
start_urls = ["https://www.citypopulation.de/en/southkorea/busan/admin/"]
def parse(self, response):
df = pd.read_html(response.text)[0]
for i, row in df.iterrows():
yield {
'name': row['Name'],
'status': row['Status']
}
Run with scrapy crawl sk -o sk_areas.json
, and it will produce a json file with this structure:
[
{"name": "Buk-gu [North Distrikt]", "status": "City District"},
{"name": "Deokcheon 1-dong", "status": "Quarter"},
{"name": "Deokcheon 2-dong", "status": "Quarter"},
{"name": "Deokcheon 3-dong", "status": "Quarter"},
{"name": "Geumgok-dong", "status": "Quarter"},
{"name": "Gupo 1-dong", "status": "Quarter"},
{"name": "Gupo 2-dong", "status": "Quarter"},
{"name": "Gupo 3-dong", "status": "Quarter"},
[...]
{"name": "Yeonsan 6-dong", "status": "Quarter"},
{"name": "Yeonsan 8-dong", "status": "Quarter"},
{"name": "Yeonsan 9-dong", "status": "Quarter"},
{"name": "Busan", "status": "Metropolitan City"}
]
As you can see, it will include Busan as well.