Deleting duplicates from List of dict elements (created from Twitter json objects)-CodePudding

I have downloaded Twitter Users' objects,

This is example of One object

{
    "id": 6253282,
    "id_str": "6253282",
    "name": "Twitter API",
    "screen_name": "TwitterAPI",
    "location": "San Francisco, CA",
    "profile_location": null,
    "description": "The Real Twitter API. Tweets about API changes, service issues and our Developer Platform. Don't get an answer? It's on my website.",
    "url": "https:\/\/t.co\/8IkCzCDr19",
    "entities": {
        "url": {
            "urls": [{
                "url": "https:\/\/t.co\/8IkCzCDr19",
                "expanded_url": "https:\/\/developer.twitter.com",
                "display_url": "developer.twitter.com",
                "indices": [
                    0,
                    23
                ]
            }]
        },
        "description": {
            "urls": []
        }
    },
    "protected": false,
    "followers_count": 6133636,
    "friends_count": 12,
    "listed_count": 12936,
    "created_at": "Wed May 23 06:01:13  0000 2007",
    "favourites_count": 31,
    "utc_offset": null,
    "time_zone": null,
    "geo_enabled": null,
    "verified": true,
    "statuses_count": 3656,
    "lang": null,
    "contributors_enabled": null,
    "is_translator": null,
    "is_translation_enabled": null,
    "profile_background_color": null,
    "profile_background_image_url": null,
    "profile_background_image_url_https": null,
    "profile_background_tile": null,
    "profile_image_url": null,
    "profile_image_url_https": "https:\/\/pbs.twimg.com\/profile_images\/942858479592554497\/BbazLO9L_normal.jpg",
    "profile_banner_url": null,
    "profile_link_color": null,
    "profile_sidebar_border_color": null,
    "profile_sidebar_fill_color": null,
    "profile_text_color": null,
    "profile_use_background_image": null,
    "has_extended_profile": null,
    "default_profile": false,
    "default_profile_image": false,
    "following": null,
    "follow_request_sent": null,
    "notifications": null,
    "translator_type": null
}

but somehow it has many duplicates, maybe the input file had duplicated values.

This is the pattern of downloaded Twitter File. I named it as rawjson { user-object }{ user-object }{ user-object }

So I ended up with a 16 GB file of users with repeated values. I need to delete the duplicated users.

This is what I have done so far

def twitterToListJsonMethodTwo(self, rawjson, twitterToListJson):
# Delete Old File
if (os.path.exists(twitterToListJson)):
    try:
        os.remove(twitterToListJson)
    except OSError:
        pass
counter = 1
objc = 1
with open(rawjson, encoding='utf8') as fin, open(twitterToListJson, 'w', encoding='utf8') as fout:
    for line in fin:
        if (line.find('}{') != -1 and len(line) == 3):
            objc = objc   1
            fout.write(line.replace('}{', '},\n{'))
        else:
            fout.write(line)
        counter = counter   1
        # print(counter)
    print("Process Complete: Twitter object to Total lines: ", counter)

    self.twitterToListJsonMethodOne(twitterToListJson)

and the output sample file looks like this. Now

[
    {user-object},
    {user-object},
    {user-object} 
]

While each user-object is dict But I can not find a way to remove the duplicates, all of the tutorials/solutions I found are just for small objects and small lists. I am not very good with python but I need some optimal solution as the file size is too big and memory could be a problem.

While each user-object is like below, with unique id and screen_name

CodePudding user response：

To process huge JSON datasets, especially long lists of objects, it's better to use JSON streaming from https://github.com/daggaz/json-stream to read the user objects one by one, then add them to your results if this user was not encountered before.

Example:

import json_stream

unique_users = []
seen_users = set()
with open('input.json') as f:
    js = json_stream.load(f)
    for us in js:
        user = dict(us.items())
        if user['id'] not in seen_users:
            unique_users.append(user)
            seen_users.add(user['id'])

The reason for user = dict(us.items()) is that if we go looking for the id in the object via the stream, we can't backtrack to get the whole object any more. So we need to "render" out every user object and then check the id.

CodePudding user response：

You could modify a merge sort and just delete duplicates in O(nlogn).

CodePudding user response：

Use ijson like it is used here.
Create a set that will hold the item id.
If the id is in the set - drop the item, else - collect the item

CodePudding user response：

Convert the dictionaries into tuples using the items() dict method to turn the list of dictionaries into a list of tuples. Then you can run set() on the list to get rid of duplicates because tuples are hashable. While using items() on each dict, remember to use tuple() on that. Sample code would be:

data = (tuple(d.items()) for d in twitter_data)

This should solve the issue of duplicate dictionaries if the dictionaries are identical on every key-value pairs.