Home > database >  How do I scrape only image posts from 9gag
How do I scrape only image posts from 9gag

Time:05-28

I want to scrape the first image post and blacklist the url for the next search, that he skip the already used url and search for the next image post. I tried this to find the first image, but it dont works.

driver = webdriver.Chrome()
driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

error: Traceback (most recent call last): File "C:\Users\klaus\PycharmProjects\testTEST\main.py", line 37, in gagposttitle = gagpost.find_element(By,value='img').get_attribute('alt') File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py", line 763, in find_element return self._execute(Command.FIND_CHILD_ELEMENT, File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webelement.py", line 740, in _execute return self.parent.execute(command, params) File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\webdriver.py", line 428, in execute response = self.command_executor.execute(driver_command, params) File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\remote_connection.py", line 345, in execute data = utils.dump_json(params) File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\site-packages\selenium\webdriver\remote\utils.py", line 23, in dump_json return json.dumps(json_struct) File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json_init.py", line 231, in dumps return _default_encoder.encode(obj) File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 199, in encode chunks = self.iterencode(o, _one_shot=True) File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 257, in iterencode return _iterencode(o, 0) File "C:\Users\klaus\AppData\Local\Programs\Python\Python310\lib\json\encoder.py", line 179, in default raise TypeError(f'Object of type {o.class.name} ' TypeError: Object of type type is not JSON serializable

Process finished with exit code 1

I also tried this and sometimes it worked, sometimes not.

driver = webdriver.Chrome()

driver.get('https://9gag.com/funny')
time.sleep(2)
driver.find_element(By.XPATH, value='//*[@id="qc-cmp2-ui"]/div[2]/div/button[1]/span').click()
time.sleep(2)
gagpost = driver.find_element(By.CSS_SELECTOR,value=".image-post img")
gagpostsurl = gagpost.get_attribute('src')
gagposttitle = gagpost.get_attribute('alt')
print(gagpostsurl)
print(gagposttitle)

I would appreciate any help.

CodePudding user response:

You can achieve this like so:

from selenium.common.exceptions import NoSuchElementException
...
# Get the feed element
feed = driver.find_element(By.CSS_SELECTOR, "div.main-wrap section#list-view-2")
# Get the streams from the feed
streams = feed.find_elements(By.CLASS_NAME, "list-stream")
# Debug number of streams
print(f"Streams: {len(streams)}")
# Iterate over each stream
for stream in streams:
    # Find articles within the stream; these are the 'posts'
    articles = stream.find_elements(By.TAG_NAME, "article")
    # Debug number of articles
    print(f"Articles: {len(articles)}")
    # Iterate over each article
    for article in articles:
        # Try/except here because some articles are adverts, these are skipped
        try:
            # Find the article title
            title = article.find_element(By.CSS_SELECTOR, "header > a")
        except NoSuchElementException:
            continue
        # Print the article title
        print(f"Title: {title.text}")

This prints out

Streams: 1
Articles: 3
Title: Hahahahaha Git Gud
Title: How to impress your guests

This isn't printing out all of the posts on the page because they are loaded lazily. This means that the posts are fetched from the server as you scroll. To load them, you will need to implement scrolling functionality to the above code. Luckily, Python Selenium's documentation has an example for this particular case. You can also refer to a previous answer of mine for how the implementation might look.

I have only added enough code to get the title, you can extract the rest of the information you need from the article variable within the embedded loop.

  • Related