Home > Back-end >  My scrapping code skips new line - Scrapy
My scrapping code skips new line - Scrapy

Time:12-18

I have this code to scrape review text from IMDB. I want to retrieve the entire text from the review, but it skips every time there is a new line, for example:

Saw an early screening tonight in Denver.

I don't know where to begin. So I will start at the weakest link. The acting. Still great, but any passable actor could have been given any of the major roles and done a great job.

The code will only retrieve

Saw an early screening tonight in Denver.

Here is my code:

reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')
first_review = reviews[0]
sel2 = Selector(text = first_review.get_attribute('innerHTML'))

rating_list = []
review_date_list = []
review_title_list = []
author_list = []
review_list = []

error_url_list = []
error_msg_list = []
reviews = driver.find_elements(By.CSS_SELECTOR, 'div.review-container')

for d in tqdm(reviews):
    try:
        sel2 = Selector(text = d.get_attribute('innerHTML'))
        try:
            rating = sel2.css('.rating-other-user-rating span::text').extract_first()
        except:
            rating = np.NaN
        try:
            review = sel2.css('.text.show-more__control::text').get()
        except:
            review = np.NaN
        try:
            review_date = sel2.css('.review-date::text').extract_first()
        except:
            review_date = np.NaN    
        try:
            author = sel2.css('.display-name-link a::text').extract_first()
        except:
            author = np.NaN    
        try:
            review_title = sel2.css('a.title::text').extract_first()
        except:
            review_title = np.NaN

        rating_list.append(rating)
        review_date_list.append(review_date)
        review_title_list.append(review_title)
        author_list.append(author)
        review_list.append(review)

    except Exception as e:
        error_url_list.append(url)
        error_msg_list.append(e)
review_df = pd.DataFrame({
    'review_date':review_date_list,
    'author':author_list,
    'rating':rating_list,
    'review_title':review_title_list,
    'review':review_list
    })

CodePudding user response:

Use .extract() instead of .get() to extract all texts in the type of list. Then, you can use .join() to concatenate all texts into a single string.

review = sel2.css('.text.show-more__control::text').extract()
review = ' '.join(review)

output:

'For a teenager today, Dunkirk must seem even more distant than the Boer War did to my generation growing up just after WW2. For some, Christopher Nolan's film may be the most they will know about the event. But it's enough in some ways because even if it doesn't show everything that happened, maybe it goes as close as a film could to letting you know how it felt. "Dunkirk" focuses on a number of characters who are inside the event, living it ....'

  • Related