Scrapy - Remove comma and whitespace from getall() results-CodePudding

would there be an effective way to directly remove commas from the yielded results via getall()? As an example, the data I'm trying to retrieve is in this format:

<div>
Text 1
<br>
Text 2
<br>
Text 3
</div>

My current selector for this is:

response.xpath("//div//text()").getall()

Which does get the correct data but they come out as:

Text 1,
Text 2,
Text 3

instead of

Text 1
Text 2
Text 3

I understand that they get recognized as a list which is the reason for the commas but would there be a direct function to remove them without affecting the commas from the text itself?

CodePudding user response：

As long as you are sure that any commas appearing at the end of each text sequence are not desired, you can use rstrip to strip only the characters you specify from the right end of the string sequence.

This is a built in method for all strings in python. However since it sometimes happens that scrapy will return None when it can't find a certain element on the parsed page I suggest implementing some minimal error checking before applying this function, otherwise it will throw an error.

for example:

string_list = response.xpath("//div//text()").getall()
clean_list = [i.rstrip(',') if i is not None else i for i in string_list]

You should do the same if you are using get() as well.

elem = response.xpath(...).get()
if elem:
   elem = elem.rstrip(',')

Update:

I think the issue you are facing could be more easily fixed with a different strategy for crawling the page.

for example:

class ASpider(scrapy.Spider):
    name = 'temp'
    start_urls = ['https://www.citiworldprivileges.com/hk-hong_kong/hotel_dining/10_off-340391']

    def parse(self, response):
        tnc = response.xpath('//div[@]//text()').getall()
        tnc = [i.strip() for i in tnc]
        yield {'terms': tnc}

And this is the csv output:

terms
"1. Advance booking is required.,
2. Offer is not applicable on Public Holidays and Eves, Valentine’s day, Mother’s Day, Father’s Day, Mid-Autumn Festival, Winter Solstice, Christmas Day and Eve.,
3. Offer is valid for dine-in at Café@WM only.,
4. Offer is not applicable for cigarette, bottle wine, special promotion, special occasion package and happy hour period.,
5. Offer is not applicable to 10% service charge, the 10% service charge is based on original price.,
6. Present valid Citi Credit Card/Debit Card/Corporate Card upon billing, void if altered or copied.,
7. Offer cannot be used in conjunction with other promotional offers or VIP discount.,
8. In case of any disputes, WM Hotel reserves the right of the final decision.,
9. The offers apply to holders of Citi Credit Cards and Debit Cards issued by Citibank (Hong Kong) Limited and other Citi entities.,
10. General terms and conditions apply. Please visit citibank.hk/yro2022tnc for details."

Alternate method:

def parse(self, response):
    for i in response.xpath('//div[@]//text()').getall():
        yield {"term": i.strip()}

csv output:

term
1. Advance booking is required.
"2. Offer is not applicable on Public Holidays and Eves, Valentine’s day, Mother’s Day, Father’s Day, Mid-Autumn Festival, Winter Solstice, Christmas Day and Eve."
3. Offer is valid for dine-in at Café@WM only.
"4. Offer is not applicable for cigarette, bottle wine, special promotion, special occasion package and happy hour period."
"5. Offer is not applicable to 10% service charge, the 10% service charge is based on original price."
"6. Present valid Citi Credit Card/Debit Card/Corporate Card upon billing, void if altered or copied."
7. Offer cannot be used in conjunction with other promotional offers or VIP discount.
"8. In case of any disputes, WM Hotel reserves the right of the final decision."
9. The offers apply to holders of Citi Credit Cards and Debit Cards issued by Citibank (Hong Kong) Limited and other Citi entities.
10. General terms and conditions apply. Please visit citibank.hk/yro2022tnc for details.

One last alternative would be to replace commas in the webpage text with an alternative delimiter such as ';' or '_' then the only commas will be those separating lines from the output.

For example:

def parse(self, response):
    terms = response.xpath('//div[@]//text()').getall()
    yield {"terms": [i.strip().replace(',', '_') for i in terms]}

and the csv output is:

terms
"1. Advance booking is required.,2. Offer is not applicable on Public Holidays and Eves_ Valentine’s day_ Mother’s Day_ Father’s Day_ Mid-Autumn Festival_ Winter Solstice_ Christmas Day and Eve.,3. Offer is valid for dine-in at Café@WM only.,4. Offer is not applicable for cigarette_ bottle wine_ special promotion_ special occasion package and happy hour period.,5. Offer is not applicable to 10% service charge_ the 10% service charge is based on original price.,6. Present valid Citi Credit Card/Debit Card/Corporate Card upon billing_ void if altered or copied.,7. Offer cannot be used in conjunction with other promotional offers or VIP discount.,8. In case of any disputes_ WM Hotel reserves the right of the final decision.,9. The offers apply to holders of Citi Credit Cards and Debit Cards issued by Citibank (Hong Kong) Limited and other Citi entities.,10. General terms and conditions apply. Please visit citibank.hk/yro2022tnc for details."

CodePudding user response：

I'm just going to leave the solution I used in case someone needs it:

tc = response.xpath("//div//text()").getall() #xpath selector
tcl = "".join(tc) #used to convert the list into a string