Home > Enterprise >  Remove duplicates from List of dynamic objects
Remove duplicates from List of dynamic objects

Time:12-02

Goal: remove duplicates from the same deepest sub-list. Keep others.

List contains multiple: dict -> dict -> list

However, a different sub-list may contain the exact same sentence as a different sub-list. These need to be kept.

set() seems ideal, but I want this applied on the deepest sub-lists. Not on the my_list object. This structure may change and have deeper dicts and lists in different runs.


Code:

I've had many variations of this, but in reality my_list can have any structure.

Is what I want possible, if the structure may be different?

my_list =  # ...

for ele in my_list:
    if isinstance(ele, list):
      ele = list(set(ele))
    elif: isinstance(ele, dict):
      

my_list:

e.g. 1st PDF -> ECON -> awards and 1st PDF -> ECON -> security contain the same duplicates.

[
    {
        "../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf": {
            "COMP": {
                "Behaviour": [
                    "we focus apply measures four elements safety culture systems processes skills knowledge individuals behaviours attitudes perception leadership"
                ]
            },
            "ECON": {
                "subsidies": [
                    "meanwhile main recent regulatory impact business significant phasing subsidies gas electricity prices expected continue next years well nationwide strategy allocates natural gas conservatively"
                ],
                "awards": [
                    "ensure robust security 100 readiness times participate international awards rospa bsc awards",
                    "ensure robust security 100 readiness times participate international awards rospa bsc awards"
                ],
                "security": [
                    "ensure robust security 100 readiness times participate international awards rospa bsc awards",
                    "ensure robust security 100 readiness times participate international awards rospa bsc awards"
                ]
            }
        }
    },
    {
        "../data/gri/reports/GRI_2018_Report.pdf": {
            "COMP": {
...

Desired List:

[
    {
        "../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf": {
            "COMP": {
                "Behaviour": [
                    "we focus apply measures four elements safety culture systems processes skills knowledge individuals behaviours attitudes perception leadership"
                ]
            },
            "ECON": {
                "subsidies": [
                    "meanwhile main recent regulatory impact business significant phasing subsidies gas electricity prices expected continue next years well nationwide strategy allocates natural gas conservatively"
                ],
                "awards": [
                    "ensure robust security 100 readiness times participate international awards rospa bsc awards"
                ],
                "security": [
                    "ensure robust security 100 readiness times participate international awards rospa bsc awards"
                ]
            }
        }
    },
    {
        "../data/gri/reports/GRI_2018_Report.pdf": {
            "COMP": {
...

Please let me know if I should clarify anything else.

CodePudding user response:

So it sounds like the only duplicates you care about are when you have a list of strings, so we can make some assumptions:

  • It's only JSON (lists, dicts, strings and primitives)
  • If we fail to hash an object, then it can't be a duplicate
  • Order of deduped lists doesn't matter

So let's use recursion

def dedup(obj):
    if insinstance(obj, list):
        try:
            # We try to dedupe as if everything is hashable,
            # but this will fail for a list of dicts, so fallback
            # in that case.
            return list({dedup(x) for x in obj})
        except TypeError:
            return [dedup(x) for x in obj]
    elif isinstance(obj, dict):
        return {k: dedup(v) for k, v in obj.items()}
    else:
        # this is some kind of primitive (str/int/float/bool/None)
        return obj
  • Related