Goal: remove duplicates from the same deepest sub-list. Keep others.
List contains multiple: dict -> dict -> list
However, a different sub-list may contain the exact same sentence as a different sub-list. These need to be kept.
set()
seems ideal, but I want this applied on the deepest sub-lists. Not on the my_list
object. This structure may change and have deeper dicts
and lists
in different runs.
Code:
I've had many variations of this, but in reality my_list
can have any structure.
Is what I want possible, if the structure may be different?
my_list = # ...
for ele in my_list:
if isinstance(ele, list):
ele = list(set(ele))
elif: isinstance(ele, dict):
my_list
:
e.g. 1st PDF -> ECON -> awards
and 1st PDF -> ECON -> security
contain the same duplicates.
[
{
"../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf": {
"COMP": {
"Behaviour": [
"we focus apply measures four elements safety culture systems processes skills knowledge individuals behaviours attitudes perception leadership"
]
},
"ECON": {
"subsidies": [
"meanwhile main recent regulatory impact business significant phasing subsidies gas electricity prices expected continue next years well nationwide strategy allocates natural gas conservatively"
],
"awards": [
"ensure robust security 100 readiness times participate international awards rospa bsc awards",
"ensure robust security 100 readiness times participate international awards rospa bsc awards"
],
"security": [
"ensure robust security 100 readiness times participate international awards rospa bsc awards",
"ensure robust security 100 readiness times participate international awards rospa bsc awards"
]
}
}
},
{
"../data/gri/reports/GRI_2018_Report.pdf": {
"COMP": {
...
Desired List:
[
{
"../data/gri/reports/GPIC_Sustainability_Report_2020__-_40_Years_of_Sustainable_Success.pdf": {
"COMP": {
"Behaviour": [
"we focus apply measures four elements safety culture systems processes skills knowledge individuals behaviours attitudes perception leadership"
]
},
"ECON": {
"subsidies": [
"meanwhile main recent regulatory impact business significant phasing subsidies gas electricity prices expected continue next years well nationwide strategy allocates natural gas conservatively"
],
"awards": [
"ensure robust security 100 readiness times participate international awards rospa bsc awards"
],
"security": [
"ensure robust security 100 readiness times participate international awards rospa bsc awards"
]
}
}
},
{
"../data/gri/reports/GRI_2018_Report.pdf": {
"COMP": {
...
Please let me know if I should clarify anything else.
CodePudding user response:
So it sounds like the only duplicates you care about are when you have a list of strings, so we can make some assumptions:
- It's only JSON (lists, dicts, strings and primitives)
- If we fail to hash an object, then it can't be a duplicate
- Order of deduped lists doesn't matter
So let's use recursion
def dedup(obj):
if insinstance(obj, list):
try:
# We try to dedupe as if everything is hashable,
# but this will fail for a list of dicts, so fallback
# in that case.
return list({dedup(x) for x in obj})
except TypeError:
return [dedup(x) for x in obj]
elif isinstance(obj, dict):
return {k: dedup(v) for k, v in obj.items()}
else:
# this is some kind of primitive (str/int/float/bool/None)
return obj