So I am for looping over a directory and I am reading some JSON files on those files, I parse out 4 keys and then I create a CSV file with all the parsed out data
It happens that I have duplicate entries so I want to eliminate duplicates based on date(newer) and then re-write? the CSV not sure how to implement it
e.g:
def mdy_to_ymd(d):
# convert the date into comparable string
cor_date = datetime.strptime(d, '%b %d %Y').strftime('%d/%m/%Y')
return time.strptime(cor_date, "%d/%m/%Y")
def date_converter(date): # convert the date to readable string for csv
return datetime.strptime(date, '%b %d %Y').strftime('%d/%m/%Y')
def csv_generator(path): # creating the csv
list_json = []
ffresult = []
duplicate_dict = {}
for file in os.listdir(path): # iterating through the directory with the files
fresult = []
with open(f"{directory}/{file}", "r") as result: # opening the json file
templates = json.load(result)
hostname_str = file.split(".")
site_code_str = (f"{file[:5]}")
datetime_str3 = (mdy_to_ymd(datetime_str2)) # converting the date to comparable data
duplicate_dict[hostname_str[0]] = datetime_str3
"""?? i am creating a
dictionary which as key has the hostname and as date has the date
but it doesnt work since when there is the same hostname it only updates the current key and there are
not duplicates but it doesnt guarantee there are only the newest based on date"""
fresult.append(site_code_str)
fresult.append(hostname_str[0])
fresult.append((templates["execution_status"]))
fresult.append(date_converter(datetime_str2))
fresult.append(templates["protocol_name"])
fresult.append(templates["protocol_version"])
ffresult.append(fresult)
# i append the values i need into 2 lists
with open("jsondicts.csv", "w") as dst:
writetoit = csv.writer(dst)
writetoit.writerows(csv_generator(directory))
# this is how i write to csv so right now i have duplicate values on the csv
I want to have only unique values based on hostname but also only the newest unique ones based on the date of course also the other parsed out data (protocol name, site code, etc)
CodePudding user response:
this solves it i had to use pandas lib though
result_pan_xls = (result_pan.sort_values(by="Execution_Date").drop_duplicates(subset="HOSTNAME",keep="last"))