Home > Enterprise >  Extract data from Dell Community Forum for a specific date
Extract data from Dell Community Forum for a specific date

Time:11-03

I want to extract the username, post title, post time and the message content from a Dell Community Forum thread of a particular date and store it into an excel file.

For example, URL: https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017

I want to extract the post title: "I am getting time sync errror and the last synced time shown as a day in 2015"

And details(username, post time, message) of comments for the date 10-25-2022 only

  1. jraju, 04:20 AM, "This pc is desktop inspiron 3910 model . The dell supplied only this week."
  2. Mary G, 09:10 AM, "Try rebooting the computer and connecting to the internet again to see if that clears it up. Don't forget to run Windows Update to get all the necessary updates on a new computer."
  3. RoHe, 01:00 PM, "You might want to read Fix: Time synchronization failed on Windows 11. Totally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC. NOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen."

Not any other comments.

I'm very new to this.

Till now I've just managed to extract information(no username) without the date filter.

I'm very new to this.

Till now I've just managed to extract information(no username) without the date filter.


import requests
from bs4 import BeautifulSoup

url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"

result = requests.get(url)
doc = BeautifulSoup(result.text, "html.parser")

###### time ######
time = doc.find_all('span', attrs={'class':'local-time'})
print(time)
##################

##### date #######
date = doc.find_all('span', attrs={'class':'local-date'})
print(date)
#################

#### message ######
article_text = ''
article = doc.find_all("div", {"class":"lia-message-body-content"})
for element in article:
    article_text  = '\n'   ''.join(element.find_all(text = True))
    
print(article_text)
##################
all_data = []
for t, d, m in zip(time, date, article):
    all_data.append([t.text, d.get_text(strip=True),m.get_text(strip=True, separator='\n')])

with open('data.csv', 'w', newline='', encoding="utf-8") as csvfile:
    writer = csv.writer(csvfile, delimiter=',', quotechar='"', quoting=csv.QUOTE_MINIMAL)
    for row in all_data:
        writer.writerow(row)

CodePudding user response:

it seems to me you have an issue with your selectors and the fact that you're searching for them in the general scope (the entire HTML body). My approach would be to narrow down 'components' and search inside them:

  1. Locate the div that holds all comments
  2. Search inside it for each comment comment container
  3. Get the username, date and comment info from each comment container

Here is how you can achieve this:

import requests
from bs4 import BeautifulSoup

url = "https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017"

result = requests.get(url)
soup = BeautifulSoup(result.text, "html.parser")

date = '10-25-2022'
comments = []

comments_section = soup.find('div', {'class':'lia-component-message-list-detail-with-inline-editors'})
comments_body = comments_section.find_all('div', {'class':'lia-linear-display-message-view'})
for comment in comments_body:
    if date in comment.find('span',{'class':'local-date'}).text:
        comments.append({
            'name': comment.find('a',{'class':'lia-user-name-link'}).text,
            'date': comment.find('span',{'class':'local-date'}).text,
            'comment': comment.find('div',{'class':'lia-message-body-content'}).text,
        })

data = {
    "title": soup.find('div', {'class':'lia-message-subject'}).text,
    "comments": comments
}

print(data)

This script generates an JSON object (stringified) that looks like this:

{
   "title":"\n\n\n\n\n\t\t\t\t\t\t\tI am getting time sync errror and the last synced time shown as a day in 2015\n\t\t\t\t\t\t\n\n\n\n",
   "comments":[
      {
         "name":"jraju",
         "date":"10-25-2022",
         "comment":"This pc is desktop inspiron 3910 model . The dell supplied only this week."
      },
      {
         "name":"Mary G",
         "date":"10-25-2022",
         "comment":"Try rebooting the computer and connecting to the internet again to see if that clears it up.\\xa0\nDon't forget to run Windows Update to get all the necessary updates on a new computer.\\xa0\n\\xa0"
      },
      {
         "name":"RoHe",
         "date":"10-25-2022",
         "comment":"You might want to read Fix: Time synchronization failed on Windows 11.\nTotally ignore the part about downloading the software tool, and scroll down that same page to the part: How to manually sync time on a Windows 11 PC.\nNOTE: In step #6, if time.windows.com doesn't work, pick a different server from the drop-down menu on that screen.\n\nRon\\xa0\\xa0 Forum Member since 2004\\xa0\\xa0 I'm not a Dell employee"
      }
   ]
}

As an engineer at WebScrapingAPI, I can also recommend you our tool, which would prevent detection, making your scraper more reliable on the long term.

The only thing that needs changing for it to work is the URL you're requesting. In this case, the targeted website would become a parameter of our API's endpoint. Everything else stays the same.

The url variable would then become:

url = 'https://api.webscrapingapi.com/v1?api_key=<YOUR_API_KEY>&url=https://www.dell.com/community/Inspiron-Desktops/I-am-getting-time-sync-errror-and-the-last-synced-time-shown-as/m-p/8290678#M36017'

CodePudding user response:

You can get the usernames by targeting the lia-component-message-view-widget-author-username class

username = doc.find_all('span', attrs={'class':'lia-component-message-view-widget-author-username'})

and then include it in all_data with an if for filtering:

all_data = []
for u, t, d, m in zip(username, time, date, article):
    if d.get_text(strip=True)[1:] == '10-25-2022': 
        all_data.append([
            u.get_text(strip=True) , 
            t.text, 
            d.get_text(strip=True),
            m.get_text(strip=True, separator='\n')
        ])

Btw, list comprehension is a bit faster than appending on loop, and you can just define and fill all_data in one statement:

all_data = [[
        u.get_text(strip=True), 
        t.text, 
        d.get_text(strip=True),
        m.get_text(strip=True, separator='\n')
    ] for u, t, d, m in zip(uname, time, date, article)
    if d.get_text(strip=True)[1:] == '10-25-2022'
]
  • Related