Home > Back-end >  Sort pandas data frame based on values assigned in separate dictionary
Sort pandas data frame based on values assigned in separate dictionary

Time:04-07

I have excel tables of biographic events ranging from birth to death. In between, there are all kinds of professional and personal events, such as university graduation, marriage, or retirement. The preferred course of action would be to sort those by date for each person in the data set, but as these are historical data, we have no exact dates for most events.

My plan therefore is to assign a chronological value from "1" for birth to "100" for deatch to each event and rank other events higher or lower in between to be able to re-construct fuzzy timelines.

For the event coding, I have created a dictionary:


event_value_dict={"Sonstiges":0, 
                  "Geburt":1, 
                  "Taufe":2, 
                  "Primäre Bildungsstation":3, 
                  "Privatunterricht":3,
                  "Rezeption":4, # nicht sicher ob bezogen auf Studium?
                  "Zulassung":9, # vor dem Studium, oder z.B. auch zur Prüfung?
                  "Immatrikulation":10,
                  "Studium":11,
                  "Prüfungsverfahren":11,
                  "Graduation":12,
                  "Praktikum":13,
                  "Promotion":14,
                  "Wohnsitznahme": 20,
                  "Reise":20, # Events mit Code "20" können in der Lebensmitte mehrfach auftreten
                  "Nobilitierung":20,
                  "Aufnahme":20,
                  "Aufschwörung":20,
                  "Eheschließung":20,
                  "Funktionsausübung":20,
                  "erfolglose Bewerbung":20,
                  "Rejektion":20,
                  "Aufenthalt":20,
                  "mittelbare Nobilitierung":20,
                  " Privilegierung":20,
                  "Wappenbesserung":20,
                  "Introduktion":30, # bezogen worauf?
                  "Mitgliedschaft":30,
                  "Gesandtschaft":30, # vermutlich nicht für ganz junge Personen?
                  "Präsentation":30, # nicht sicher was das ist...
                  "Vokation":39, # Berufung an Uni?
                  "Ernennung":40,
                  "Amtseinführung":41,
                  "Vereidigung":41,
                  "Amtsantritt":42,
                  "Beförderung":44, # wie oft werden Personen durchschnittlich befördert?
                  "Ehrung":45, # vermutlich bei Personen ab Lebensmitte?
                  "Entlassung":50,
                  "Suspendierung":50,
                  "Absetzung":50,
                  "Resignation":50,
                  "Rücktritt":50,
                  "Pensionierung":90,
                  "Pension":91,
                  "Tod":100}

Is there a way to apply these values to a data frame when sorting by column without adding a new column or overwriting the old one? As we refine our research, these values will most likely change and new ones might be added, so I am looking for the most flexible system possible.

My current script correctly identifies unique names in the data sets and give me all the events connected with each person (res_df). For sorting, I am currently using:

res_sorted=res_df.sort_values(by =['event_type', 'event_start']) # sort by event, then date

At the moment, the event names are sorted alphabetically. How can I replace this (temporarily) for values from the above dictionary?

I think that the map function might be a good solution (cf. https://kanoki.org/2019/04/06/pandas-map-dictionary-values-with-dataframe-columns/), but perhaps users here have other ideas.

CodePudding user response:

Maybe it's a naive idea but I would duplicate your event_type column use your dictionary to replace the event names by the numbers:

res_df.replace({"event_type_duplicated_column": event_value_dict})

And sort by this new column with the replaced values.

CodePudding user response:

Assuming the dict is ordered as it is now...

I really like the answer found here.

df['event_value'] = df['event_value'].astype('category')
not_in_list = df['event_value'].cat.categories.difference(list(event_value_dict))
df['event_value'] = df['event_value'].cat.set_categories(np.hstack((list(event_value_dict), not_in_list)), ordered=True)

df = df.sort_values('event_value')
  • Related