how to save file list inside list as a json file in python?-CodePudding

I 'm trying to parse data from website using beautifulsoap in python and finally I pulled data from website so I want to save data in json file but it saves the data as follows according to the code I wrote

json file

[
    {
        "collocation": "\nabove average",
        "meaning": "more than average, esp. in amount, age, height, weight etc. "
    },
    {
        "collocation": "\nabsolutely necessary",
        "meaning": "totally or completely necessary"
    },
    {
        "collocation": "\nabuse drugs",
        "meaning": "to use drugs in a way that's harmful to yourself or others"
    },
    {
        "collocation": "\nabuse of power",
        "meaning": "the harmful or unethical use of power"
    },
    {
        "collocation": "\naccept (a) defeat",
        "meaning": "to accept the fact that you didn't win a game, match, contest, election, etc."
    },

my code:

import requests
from bs4 import BeautifulSoup
from selenium import webdriver
import pandas as pd
import json


url = "https://www.englishclub.com/ref/Collocations/"

mylist = [
        "A",
        "B",
        "C",
        "D",
        "E",
        "F",
        "G",
        "H",
        "I",
        "J",
        "K",
        "L",
        "M",
        "N",
        "O",
        "P",
        "Q",
        "R",
        "S",
        "T",
        "U",
        "V",
        "W"
]


list = []


for i in range(23):
    result = requests.get(url mylist[i] "/", headers=headers)
    doc = BeautifulSoup(result.text, "html.parser")
    collocations = doc.find_all(class_="linklisting")

    for tag in collocations:
            case = {
                    "collocation": tag.a.string,
                    "meaning": tag.div.string
            }
            list.append(case)


with open('data.json', 'w', encoding='utf-8') as f:

    json.dump(list, f, ensure_ascii=False, indent=4)

but for example, I want to have a list for each letter, for example, one list for A and one more list for B so that I can easily find which one starts with which letter and use it. How can I do that. And as you can see in the json file there is always \ at the beginning of the collocation how can I remove it?

CodePudding user response：

import requests
from bs4 import BeautifulSoup
import pandas as pd
import json


url = "https://www.englishclub.com/ref/Collocations/"

mylist = [
        "A",
        "B",
        "C",
        "D",
        "E",
        "F",
        "G",
        "H",
        "I",
        "J",
        "K",
        "L",
        "M",
        "N",
        "O",
        "P",
        "Q",
        "R",
        "S",
        "T",
        "U",
        "V",
        "W"
]

#you can use dictionary instead list. suits your needs better
list = {}

#just for quick testing, i set range to 4
for i in range(4):
    list[mylist[i]] = [] #make an empty list for your collocations

    result = requests.get(url mylist[i] "/")
    doc = BeautifulSoup(result.text, "html.parser")
    collocations = doc.find_all(class_="linklisting")

    for tag in collocations:
            
            case = {
                    "collocation": tag.a.string.replace("\n",""),#replace \n indentations
                    "meaning": tag.div.string
            }
            list[mylist[i]].append(case)#add collocation to related list


with open('data.json', 'w', encoding='utf-8') as f:

    json.dump(list, f, ensure_ascii=False, indent=4)

I have written a comment for changed parts. We created an array for every letter you have in dictionary. So in the future uses, you can get them only with keys without worry about indexes

However this is the output

{
    "A": [
        {
            "collocation": "above average",
            "meaning": "more than average, esp. in amount, age, height, weight etc. "
        },
        {
            "collocation": "absolutely necessary",
            "meaning": "totally or completely necessary"
        }
    ],
    "B": [
        {
            "collocation": "back pay",
            "meaning": "money a worker earned in the past but hasn't been paid yet  "
        },
        {
            "collocation": "back road",
            "meaning": "a small country road "
        },
        {
            "collocation": "back street",
            "meaning": "a street in a town or city that's away from major roads or central areas"
        }
    ],
    "C": [
        {
            "collocation": "call a meeting",
            "meaning": "to order or invite people to hold a meeting"
        },
        {
            "collocation": "call a name",
            "meaning": "to say somebody's name loudly"
        },
        {
            "collocation": "call a strike",
            "meaning": "to decide that workers will protest by not going to work "
        }
    ],
    "D": [
        {
            "collocation": "daily life",
            "meaning": "life as experienced from day to day"
        },
        {
            "collocation": "dead ahead",
            "meaning": "straight ahead"
        },
        {
            "collocation": "dead body",
            "meaning": "corpse, or the body of someone who's died"
        }
    ]
}

CodePudding user response：

In your loop, after you define doc, try the following:

for col in doc.select('div.linklisting'):
    print(print(col.select_one('h3 a').text.strip(), "--", col.select_one('div.linkdescription').text))

For the letter B, example, it should output:

back pay -- money a worker earned in the past but hasn't been paid yet  
back road -- a small country road 
back street -- a street in a town or city that's away from major roads or central areas

etc. You can assign the output elements to a CSV, dataframe or whatever.