remove duplicate entries in two different JSON file-CodePudding

I have extensive data in many JSON files, but the problem is that they are many duplicate entries and it make lot's of trouble to find them is that any way to remove them with one click

How can I remove duplicate entries from two different JSON files using python?

First File

[
    {
        "client_id": "1343314236",
        "username": "Nandhu_y_z",
        "name": "nadhu Nandhan"
    },
    {
        "client_id": "1725943170",
        "username": "Nodsfne",
        "name": "Konnengal Z"
    },
    {
        "client_id": "1725943170",
        "username": "Nodsfne",
        "name": "Konnengal Z"
    }
]

Second File


[
    {
        "client_id": "1343314236",
        "username": "Na1dhu_y_z",
        "name": "nadhu Nandhan"
    },
    {
        "client_id": "1725943170",
        "username": "Nodsfne",
        "name": "Konnengal Z"
    },
    {
        "client_id": "1725943170",
        "username": "Nodsfne",
        "name": "Konnengal Z"
    }
]

CodePudding user response：

I created a script to solve your issue. In one folder create a .py file with the following:

import os
import json

files=os.listdir()
for file in files:
    with open(file, "r") as json_file:
        if os.path.basename(__file__) != file:
            users = json.load(json_file)
            unique_users=  list({ user['client_id'] : user for user in users }.values())
            json_with_no_repetition = open(f"{file}_wo_duplicates.json", "w")
            json_with_no_repetition.write(json.dumps(unique_users, indent=4))
            json_with_no_repetition.close()

Then, put all the files within the folder and run the script.

I should recall that you should also try it by yourself and only ask questions when you are stuck in a specific moment.

CodePudding user response：

#suppose here we have some json files with names using numbers ranging from 1 to 500 and ending with the extension .json 
#such as 1.json, 2.json, etc 
# but not limited to naming files using numbers,
# but here I use numbers for convenience only

import json
from pathlib import Path

def load_json(filename):
    file = open('{0}.json'.format(filename), 'r')
    return json.load(file)

def remove_duplicate(data1, data2):
    return [data for data in data2 if data not in data1]

def overwrite(filename, data):
    with open('{0}.json'.format(filename), 'w') as file:
        json.dump(data , file)


#start creating variables from the first file 
lists = load_json('1');

#if you also want to remove duplicate data in the first file 
lists  = [i for n, i in enumerate(lists) if i not in lists[n   1:]]
#'1' is the filename 
overwrite('1', lists)

#let's do it on all files
json_path = "./"
for file in Path(json_path).glob('*.json'):
    #get filename without extension 
    name = file.name.replace('.json', '')
    #if file 1 then skip 
    if(name == '1'): continue
    #remove duplicate data 
    data = remove_duplicate(lists, load_json(name))
    #then overwrite the file 
    overwrite(name, data)
    #add in lists to filter the next file 
    lists  = data