Home > Software design >  How to create proper formatted JSON file with BeautifulSoup scraped information?
How to create proper formatted JSON file with BeautifulSoup scraped information?

Time:12-05

When I run this script...

from bs4 import BeautifulSoup
import requests
import json

url = "https://stock.adobe.com/search?k=interstellar movie"
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")

jfile = open('images.json', 'w')

title = soup.find('em', class_='gravel-text').text

for images in soup.find_all('div', class_='thumb-frame'):
    image = images.a['href']

    j = [{'title':title, 'image':image}]

    jstring = json.dumps(j)

    jfile.write(jstring)

jfile.close()

The problem that I'm having is that it creates multiple root elements with the title within every element.

Formatted output...

[
   {
      "title":"interstellar movie",
      "image":"https://stock.adobe.com/images/gargantua-galaxy-design-graphic-3d-illustration-red-wormhole-or-black-hole-shine-in-space-inspiration-from-interstellar-movie-night-sky-background/90184980"
   }
][
   {
      "title":"interstellar movie",
      "image":"https://stock.adobe.com/images/wanderlust-explorer-discovering-icelandic-natural-wonders/395993532"
   }
][
   {
      "title":"interstellar movie",
      "image":"https://stock.adobe.com/images/panoramic-beautiful-night-sky-and-star-abstract-background-elements-of-this-image-furnished-by-nasa/223412156"
   }
]

I've tried moving the j variable outside the loop and writing the title, but can't seem to figure out how to combine the looped variable with the other one. I'm trying to output the title only once with the list of links so the output becomes as such...

[{
    "title": "interstellar movie", 
    "image": [
        "https://stock.adobe.com/images/gargantua-galaxy-design-graphic-3d-illustration-red-wormhole-or-black-hole-shine-in-space-inspiration-from-interstellar-movie-night-sky-background/90184980",
        "https://stock.adobe.com/images/wanderlust-explorer-discovering-icelandic-natural-wonders/395993532",
        "https://stock.adobe.com/images/panoramic-beautiful-night-sky-and-star-abstract-background-elements-of-this-image-furnished-by-nasa/223412156"
    ]
}]

Thanks in advance :)

CodePudding user response:

To fix the issue, you can move the j variable outside of the loop and add the image values to the j list as they are extracted in the loop. Here is an example of how your code could be modified to do that:

from bs4 import BeautifulSoup
import requests
import json

url = "https://stock.adobe.com/search?k=interstellar movie"
page = requests.get(url).text
soup = BeautifulSoup(page, "lxml")

jfile = open('images.json', 'w')

title = soup.find('em', class_='gravel-text').text

# Create an empty list for the image URLs
j = [{'title':title, 'image': []}]

# Loop through the images and append each URL to the list
for images in soup.find_all('div', class_='thumb-frame'):
    image = images.a['href']
    j[0]['image'].append(image)

# Convert the list to a JSON string and write it to the file
jstring = json.dumps(j)
jfile.write(jstring)

jfile.close()
  • Related