I have a tree Locations
which is has a Continent -> Country -> Location hierarchy. Now I have a dataframe which per row has a list entries of this tree.
How can I replace the entries of the list per row by the leaf's its tree.
My creativity in apply
or map
and possible a lambda
function is lacking.
Minimal example;
import pandas as pd
Locations = {
'Europe':
{'Germany': ['Berlin'],
'France': ['Paris','Bordeaux']},
'Asia':
{'China': ['Hong Kong'],
'Indonesia': ['Jakarta']},
'North America':
{'United States':['New York','Washington']}}
df = pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['North America','United States','Asia','France'],
['North America','Asia','Europe','Germany']]})
df = df.apply(...)?
df = df.map(...)?
# How to end up with:
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux'],
['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux','Berlin']]})
# Note the order of the locations doesn't matter is also OK
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
['Jakarta','Washington','Hong Kong','Paris','New York','Bordeaux'],
['Jakarta','Berlin','Washington','Hong Kong','Paris','New York','Bordeaux']]})
CodePudding user response:
You do not really need the apply
method. You can start by changing the structure of your Locations
dictionary in order to map the actual values to your exploded data frame. Then, just combine several explode
, drop_duplicates
and groupby
statements with different aggregation logics to produce your desired result.
Code:
import pandas as pd
from collections import defaultdict
from itertools import chain
Locations = {
'Europe':{'Germany': ['Berlin'], 'France': ['Paris','Bordeaux']},
'Asia': {'China': ['Hong Kong'], 'Indonesia': ['Jakarta']},
'North America': {'United States': ['New York','Washington']}
}
df = pd.DataFrame({'Persons':['A', 'B'], 'Locations': [['North America','United States','Asia','France'], ['North America','Asia','Europe']]})
mapping_locs = defaultdict(list)
for key, val in Locations.items():
mapping_locs[key] = list(chain.from_iterable(list(val.values())))
for lkey, lval in val.items():
mapping_locs[lkey] = lval
(
df.assign(
mapped_locations=(
df.explode("Locations")["Locations"].map(mapping_locs).reset_index()
.explode("Locations").drop_duplicates(subset=["index", "Locations"])
.groupby(level=0).agg({"index": "first", "Locations": list})
.groupby("index").apply(lambda x: list(chain.from_iterable(x["Locations"])))
)
)
)
Output:
Persons Locations mapped_locations
0 A [North America, United States, Asia, France] [New York, Washington, Hong Kong, Jakarta, Par...
1 B [North America, Asia, Europe] [New York, Washington, Hong Kong, Jakarta, Ber...