Home > front end >  How to remove duplicate entries in list and merge non-duplicates
How to remove duplicate entries in list and merge non-duplicates

Time:06-24

I have a dict like this:

seriesdict={'series':[],'id':[]}

where series = the series name and id = the unique id associated with the book in calibre

This dict is appended for each series to a list

sorted_data=[]

as this list will be used to request data from the internet, I'd like to reduce the amount of requests I have to do, both to safe time and reduce traffic on the site. I'd like to check each series only once and move on to the next one.

I have already sorted the list according to the series, but I am struggling on how to check if the series is already in the list, and if so, how to add the following id's to the first added series.

This is what I've tried so far:

for entry in seriesdict:
    if entry['series'] not in sortedseriesdict['series']:
        sortedseriesdict['series']=entry['series']
        sortedseriesdict['ids']=entry['id']
        sorted_data.append(sortedseriesdict.copy())
    elif entry['series'] in sortedseriesdict['series']:
        sortedseriesdict['ids']=entry['id']
        sorted_data.append(sortedseriesdict.copy())

This iteration question seems similar, but I am unsure if it could help in my case, as the ids being added have to keep all old data as well.

This is a part of the list:

[{'index': 237, 'series': '5 Centimeters per Second', 'id': '13050'}
{'index': 303, 'series': '86 EIGHTY-SIX', 'id': '9809'},
{'index': 304, 'series': '86 EIGHTY-SIX', 'id': '13540'},
{'index': 305, 'series': '86 EIGHTY-SIX', 'id': '9289'},
{'index': 306, 'series': '86 EIGHTY-SIX', 'id': '13323'},
{'index': 307, 'series': '86 EIGHTY-SIX', 'id': '10783'},
{'index': 309, 'series': '86 EIGHTY-SIX', 'id': '12084'},
{'index': 310, 'series': '86 EIGHTY-SIX', 'id': '10943'},
{'index': 311, 'series': '86 EIGHTY-SIX', 'id': '9202'},
{'index': 2329, 'series': 'A Certain Magical Index', 'id': '12843'}]

I would like to create the seriesdict so that the sorted_data looks like this:

[{'series': '5 Centimeters per Second', 'ids': '9809'},
 {'series': '86 EIGHTY-SIX', 'ids': '13540, 9289, 13323, 10783, 12084, 10943, 9202'},
 {'series': 'A Certain Magical Index', 'ids': '12843'},
 ...
]

How can I do that, if it is possible?
Any answer is appreciated.

CodePudding user response:

Since you seem to be dealing with series data, I would like to suggest using pandas library. This would save you a lot of hassle and tinkering around and will propose a solution with pandas first. First we will take your seriesdictand convert it to a pandas.DataFrame object.

import pandas as pd


series_dict = [
    {"index": 237, "series": "5 Centimeters per Second", "id": "13050"},
    {"index": 303, "series": "86 EIGHTY-SIX", "id": "9809"},
    {"index": 304, "series": "86 EIGHTY-SIX", "id": "13540"},
    {"index": 305, "series": "86 EIGHTY-SIX", "id": "9289"},
    {"index": 306, "series": "86 EIGHTY-SIX", "id": "13323"},
    {"index": 307, "series": "86 EIGHTY-SIX", "id": "10783"},
    {"index": 309, "series": "86 EIGHTY-SIX", "id": "12084"},
    {"index": 310, "series": "86 EIGHTY-SIX", "id": "10943"},
    {"index": 311, "series": "86 EIGHTY-SIX", "id": "9202"},
    {"index": 2329, "series": "A Certain Magical Index", "id": "12843"},
]

df = pd.DataFrame(series_dict)

now df contains all the data we need and we can start modifying data as per your wish. For that we are going to group the data by series and take the id column of the result and apply a function that joins the column values with ,. By resetting the index, we can achieve proper structure of the result dataframe.

df = df.groupby("series")["id"].apply(", ".join).reset_index()

Now if we print the result with :

print(df)

we get

                     series                                                 id
0  5 Centimeters per Second                                              13050
1             86 EIGHTY-SIX  9809, 13540, 9289, 13323, 10783, 12084, 10943,...
2   A Certain Magical Index                                              12843

If you really want to have the data in the structure you proposed,

my_data = [value for _, value in df.to_dict(orient="index").items()]

would return

[{'series': '5 Centimeters per Second', 'id': '13050'}, {'series': '86 EIGHTY-SIX', 'id': '9809, 13540, 9289, 13323, 10783, 12084, 10943, 9202'}, {'series': 'A Certain Magical Index', 'id': '12843'}]
  • Related