Replace list of items in Pandas dataframe by tree leafs-CodePudding

I have a tree Locations which is has a Continent -> Country -> Location hierarchy. Now I have a dataframe which per row has a list entries of this tree.

How can I replace the entries of the list per row by the leaf's its tree.

My creativity in apply or map and possible a lambda function is lacking.

Minimal example;

import pandas as pd

Locations = {
            'Europe':
                {'Germany':      ['Berlin'],
                 'France':       ['Paris','Bordeaux']},
            'Asia':
                {'China':        ['Hong Kong'],
                 'Indonesia':    ['Jakarta']},
            'North America':
                {'United States':['New York','Washington']}}

df = pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
    ['North America','United States','Asia','France'], 
    ['North America','Asia','Europe','Germany']]})

df = df.apply(...)?
df = df.map(...)?

# How to end up with:
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
    ['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux'], 
    ['New York','Washington','Hong Kong','Jakarta','Paris','Bordeaux','Berlin']]})

# Note the order of the locations doesn't matter is also OK
pd.DataFrame({'Persons': ['A', 'B'], 'Locations': [
    ['Jakarta','Washington','Hong Kong','Paris','New York','Bordeaux'],
    ['Jakarta','Berlin','Washington','Hong Kong','Paris','New York','Bordeaux']]})

CodePudding user response：

You do not really need the apply method. You can start by changing the structure of your Locations dictionary in order to map the actual values to your exploded data frame. Then, just combine several explode, drop_duplicates and groupby statements with different aggregation logics to produce your desired result.

Code:

import pandas as pd
from collections import defaultdict
from itertools import chain

Locations = {
    'Europe':{'Germany': ['Berlin'], 'France': ['Paris','Bordeaux']},
    'Asia': {'China': ['Hong Kong'], 'Indonesia': ['Jakarta']},
    'North America': {'United States': ['New York','Washington']}
}

df = pd.DataFrame({'Persons':['A', 'B'], 'Locations': [['North America','United States','Asia','France'], ['North America','Asia','Europe']]})

mapping_locs = defaultdict(list)

for key, val in Locations.items():
    mapping_locs[key] = list(chain.from_iterable(list(val.values())))
    for lkey, lval in val.items():
        mapping_locs[lkey] = lval
    

(
    df.assign(
        mapped_locations=(
            df.explode("Locations")["Locations"].map(mapping_locs).reset_index()
            .explode("Locations").drop_duplicates(subset=["index", "Locations"])
            .groupby(level=0).agg({"index": "first", "Locations": list})
            .groupby("index").apply(lambda x: list(chain.from_iterable(x["Locations"])))
        )
    )
)

Output:

    Persons  Locations                                     mapped_locations
0   A        [North America, United States, Asia, France]  [New York, Washington, Hong Kong, Jakarta, Par...
1   B        [North America, Asia, Europe]                 [New York, Washington, Hong Kong, Jakarta, Ber...