I am trying to extract data from a site that has terrible html formatting, as all the info I want is in the same div and split with line breaks. I am new to web scraping in general, so please bear with me.
https://wsldata.com/directory/record.cfm?LibID=48
In order to get the parts I need, I use:
details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
returns
['\r\n',
'\r\n',
'\r\n',
'\r\n \r\n ',
'\r\n\t\t\t',
'\r\n ',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tDirector',
'\r\n Ext: 5442',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tAssistant Library Director',
'\r\n Ext: 5433',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tYouth Services Librarian',
'\r\n ',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tTechnical Services Librarian',
'\r\n Ext: 2558',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tOutreach Librarian',
'\r\n ',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n ',
'\r\n\t\t\tFoundation Executive Director',
'\r\n Ext: 5456',
'\r\n ',
'\r\n ',
'\r\n\t\t\t\r\n\t\t\t',
'\r\n \r\n',
'\r\n',
' \xa0|\xa0 ',
'\r\n']
I have managed to bring that to a desired format using the following code
import scrapy
import re
class LibspiderSpider(scrapy.Spider):
name = 'libspider'
allowed_domains = ['wsldata.com']
start_urls = ['https://wsldata.com/directory/record.cfm?LibID=48']
# Note that start_urls contains multiple links, I just simplified it here to reduce cluttering
def parse(self, response):
details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
details_clean = []
titles = []
details = []
for detail in details_raw:
detail = re.sub(r'\t', '', detail)
detail = re.sub(r'\n', '', detail)
detail = re.sub(r'\r', '', detail)
detail = re.sub(r' ', '', detail)
detail = re.sub(r' \xa0|\xa0 ', '', detail)
detail = re.sub(r'|', '', detail)
detail = re.sub(r' E', 'E', detail)
if detail == '':
pass
elif detail == '|':
pass
else:
details_clean.append(detail)
if detail[0:3] != 'Ext':
titles.append(detail)
for r in range(len(details_clean)):
if r == 0:
details.append(details_clean[r])
else:
if details_clean[r-1][0:3] != 'Ext' and details_clean[r][0:3] != 'Ext':
details.append('-')
details.append(details_clean[r])
else:
details.append(details_clean[r])
output = []
for t in range(len(details)//2):
info = {
"title": details[(t*2)],
"phone": details[(t*2 1)],
}
output.append(info)
The block of code after the response.xpath line is used to clean my input to a nicer output. When testing the code outside of scrapy, using the weird input I showed on the top of post, I get:
[{'title': 'Director', 'phone': 'Ext: 5442'}, {'title': 'Assistant Library Director', 'phone': 'Ext: 5433'}, {'title': 'Youth Services Librarian', 'phone': '-'}, {'title': 'Technical Services Librarian', 'phone': 'Ext: 2558'}, {'title': 'Outreach Librarian', 'phone': '-'}, {'title': 'FoundationExecutive Director', 'phone': 'Ext: 5456'}]
When I try to implement this code into scrapy's parse(), my log doesn't show any items scraped and I obviously get an empty json.
yield is not present in the above code, as I have tried multiple ways to implement it and none of them worked. Am I missing a connection between scrapy's response and yield or is what I am trying to do not possible and should just extract the weird list and work it off scrapy like so:
def parse(self, response):
details_raw = response.xpath('/html/body/div/table/tbody/tr/td/div/div/text()').getall()
yield{
'details_in' : details_raw
}
which extracts:
[
{"details_in": ["\r\n", "\r\n", "\r\n", "\r\n \r\n ", "\r\n\t\t\t", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tDirector", "\r\n Ext: 5442", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tAssistant Library Director", "\r\n Ext: 5433", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tYouth Services Librarian", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tTechnical Services Librarian", "\r\n Ext: 2558", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tOutreach Librarian", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n ", "\r\n\t\t\tFoundation Executive Director", "\r\n Ext: 5456", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n", "\r\n", " \u00a0|\u00a0 ", "\r\n"]},
{"details_in": ["\r\n", "\r\n", "\r\n", "\r\n \r\n ", "\r\n\t\t\tBranch Librarian", "\r\n ", "\r\n ", "\r\n ", "\r\n\t\t\t\r\n\t\t\t", "\r\n \r\n", "\r\n", " \u00a0|\u00a0 ", "\r\n"]},
...
...
]
CodePudding user response:
If you want to remove those lines from the list you can use this (instead of regex):
>>> lst=['\r\n',
... '\r\n',
... '\r\n',
... '\r\n \r\n ',
... '\r\n\t\t\t',
... '\r\n ',
... '\r\n ',
... '\r\n ',
... '\r\n\t\t\t\r\n\t\t\t',
... '\r\n \r\n ',
... '\r\n\t\t\tDirector',
... '\r\n Ext: 5442',
... '\r\n ',
... '\r\n ',
... '\r\n\t\t\t\r\n\t\t\t',
... '\r\n \r\n ',
... '\r\n\t\t\tAssistant Library Director',
... '\r\n Ext: 5433',
... '\r\n ',
... '\r\n ',
... '\r\n\t\t\t\r\n\t\t\t',
... '\r\n \r\n ',
... '\r\n\t\t\tYouth Services Librarian',
... '\r\n ',
... '\r\n ',
... '\r\n ',
... '\r\n\t\t\t\r\n\t\t\t',
... '\r\n \r\n ',
... '\r\n\t\t\tTechnical Services Librarian',
... '\r\n Ext: 2558',
... '\r\n ',
... '\r\n ',
... '\r\n\t\t\t\r\n\t\t\t',
... '\r\n \r\n ',
... '\r\n\t\t\tOutreach Librarian',
... '\r\n ',
... '\r\n ',
... '\r\n ',
... '\r\n\t\t\t\r\n\t\t\t',
... '\r\n \r\n ',
... '\r\n\t\t\tFoundation Executive Director',
... '\r\n Ext: 5456',
... '\r\n ',
... '\r\n ',
... '\r\n\t\t\t\r\n\t\t\t',
... '\r\n \r\n',
... '\r\n',
... ' \xa0|\xa0 ',
... '\r\n']
>>> newlst = [i.strip() for i in lst if i.strip()]
>>> newlst
['Director', 'Ext: 5442', 'Assistant Library Director', 'Ext: 5433', 'Youth Services Librarian', 'Technical Services Librarian', 'Ext: 2558', 'Outreach Librarian', 'Foundation Executive Director', 'Ext: 5456', '|']
You can achieve the result you want by using the correct xpath selectors:
import scrapy
class LibspiderSpider(scrapy.Spider):
name = 'libspider'
allowed_domains = ['wsldata.com']
start_urls = ['https://wsldata.com/directory/record.cfm?LibID=48']
def parse(self, response):
details_raw = response.xpath('//div[@]//div[@style="margin:16px 8px;"]')
if details_raw:
details_raw = details_raw[:-1]
for detail in details_raw:
item = dict()
item['title'] = detail.xpath('./following-sibling::br[1]/following::text()').get(default='').strip()
item['phone'] = detail.xpath('./following-sibling::br[2]/following::text()').get(default='-').strip()
yield item
The xpath selectors look like this because like you said it's:
a site that has terrible html formatting
I'm sure that you can find another xpath selectors that will fit your needs, but this one isn't terrible =).