Home > database >  python dictionary check for duplicates based on date
python dictionary check for duplicates based on date

Time:03-02

So I am for looping over a directory and I am reading some JSON files on those files, I parse out 4 keys and then I create a CSV file with all the parsed out data

It happens that I have duplicate entries so I want to eliminate duplicates based on date(newer) and then re-write? the CSV not sure how to implement it

e.g:

def mdy_to_ymd(d):
    # convert the date into comparable string
    cor_date = datetime.strptime(d, '%b %d %Y').strftime('%d/%m/%Y')
    return time.strptime(cor_date, "%d/%m/%Y")


def date_converter(date):  # convert the date to readable string for csv
    return datetime.strptime(date, '%b %d %Y').strftime('%d/%m/%Y')


def csv_generator(path):  # creating the csv
    list_json = []
    ffresult = []
    duplicate_dict = {}
    for file in os.listdir(path):  # iterating through the directory with the files
        fresult = []
        with open(f"{directory}/{file}", "r") as result:  # opening the json file
            templates = json.load(result)
            hostname_str = file.split(".")
            site_code_str = (f"{file[:5]}")
            datetime_str3 = (mdy_to_ymd(datetime_str2))  # converting the date to comparable data
            duplicate_dict[hostname_str[0]] = datetime_str3
            """?? i am creating a 
            dictionary which as key has the hostname and as date has the date 
            but it doesnt work since when there is the same hostname it only updates the current key and there are 
            not duplicates but it doesnt guarantee there are only the newest based on date"""
            fresult.append(site_code_str)
            fresult.append(hostname_str[0])
            fresult.append((templates["execution_status"]))
            fresult.append(date_converter(datetime_str2))
            fresult.append(templates["protocol_name"])
            fresult.append(templates["protocol_version"])
            ffresult.append(fresult)


# i append the values i need into 2 lists
with open("jsondicts.csv", "w") as dst:
    writetoit = csv.writer(dst)
    writetoit.writerows(csv_generator(directory))
# this is how i write to csv so right now i have duplicate values on the csv

I want to have only unique values based on hostname but also only the newest unique ones based on the date of course also the other parsed out data (protocol name, site code, etc)

CodePudding user response:

this solves it i had to use pandas lib though

result_pan_xls = (result_pan.sort_values(by="Execution_Date").drop_duplicates(subset="HOSTNAME",keep="last"))
  • Related