I am looking for a more pythonic/efficient (and short) way to factorize (=enumerate unique instances) of the row-wise set difference in an aggregated dataframe. The below tables illustrate what should not be too complex of a dataframe manipulation:
product | product_group | location |
---|---|---|
10 | 1 | RoW |
11 | 1 | US |
12 | 1 | CA |
13 | 2 | RoW |
14 | 2 | JP |
15 | 2 | US |
16 | 3 | FR |
17 | 3 | BE |
18 | 4 | RoW |
19 | 4 | US |
20 | 4 | CA |
should be transformed, using list_locations = ['US', 'CA', 'JP', 'BE', 'FR']
into a table of the form
product_group | location_list | rest_of_world_location_list | rest_of_world_index |
---|---|---|---|
1 | RoW, US, CA | JP, BE, FR | RoW_1 |
2 | RoW, JP, US | CA, BE, FR | RoW_2 |
4 | RoW, US, CA | CA, BE, FR | RoW_1 |
such that every product group
has a column rest_of_world_location_list
that lists all the items from list_locations
that are not part of a product group. Column rest_of_world_index
is simply the factorization of rest_of_world_location_list
.
MWE input data:
df = pd.DataFrame(
{
"product": [10,11,12,13,14,15,16,17,18,19,20],
"product_group": [1,1,1,2,2,2,3,3,4,4,4],
"location": ['RoW', 'US', 'CA', 'RoW', 'JP', 'US', 'FR', 'BE', 'RoW', 'US', 'CA']
}
)
list_locations = ['US', 'CA', 'JP', 'BE', 'FR']
My attempt (works, but likely too complicated):
activity_with_rest_of_world_location = pd.DataFrame(df[df['location'] == 'RoW']['product_group'])
activity_with_rest_of_world_location['index'] = activity_with_rest_of_world_location.index
df_rest_of_world = df[df['product_group'].isin(activity_with_rest_of_world_location['product_group'])]
df_rest_of_world = df_rest_of_world.drop(activity_with_rest_of_world_location['index'])
df_rest_of_world_agg = pd.DataFrame(
data = df_rest_of_world.groupby('product_group')['location'].apply(tuple))
df_rest_of_world_agg.reset_index(inplace = True)
df_rest_of_world_agg = df_rest_of_world_agg.merge(
right = activity_with_rest_of_world_location,
how = 'left',
on = 'product_group'
)
df_rest_of_world_agg.set_index(keys = 'index', inplace = True)
df_rest_of_world_agg['location_rest_of_world'] = df_rest_of_world_agg.apply(
lambda row: tuple(set(list_io_countries) - set(row['location'])),
axis = 1
)
df_rest_of_world_agg = df_rest_of_world_agg.dropna(subset ='location_rest_of_world')
df_rest_of_world_agg['location'] = pd.factorize(df_rest_of_world_agg['location_rest_of_world'])[0]
df_rest_of_world_agg['location'] = 'RoW_' df_rest_of_world_agg['location'].astype(str)
CodePudding user response:
IIUC, you can use a single pipeline with 3 steps:
world = set(list_locations)
(df.groupby('product_group', as_index=False)
# aggregate locations as string and the rest of the from from a set difference
.agg(**{'location_list': ('location', ', '.join),
'rest_of_world_location_list': ('location', lambda l: ', '.join(sorted(world.difference(l))))
})
# filter the rows without RoW
.loc[lambda d: d['location_list'].str.contains('RoW')]
# add category
.assign(rest_of_world_index=lambda d: 'RoW_' d['location_list'].astype('category').cat.codes.add(1).astype(str)
)
)
output:
product_group location_list rest_of_world_location_list rest_of_world_index
0 1 RoW, US, CA BE, FR, JP RoW_2
1 2 RoW, JP, US BE, CA, FR RoW_1
3 4 RoW, US, CA BE, FR, JP RoW_2
CodePudding user response:
Solutions with set
s - crete sets per groups, filter out no RoW
rows, get differencies with join
and last use factorize
with frozenset
s:
list_io_countries = ['US', 'CA', 'JP', 'BE', 'FR']
s = set(list_io_countries)
df = df.groupby(df['product_group'])['location'].agg(set).reset_index(name='location_list')
df = (df[['RoW' in x for x in df['location_list']]]
.assign(rest_of_world_location_list = lambda x: x['location_list'].apply(lambda x: ','.join(s - x)),
rest_of_world_index = lambda x: pd.factorize(x['location_list'].apply(lambda x: frozenset(x - set(['RoW']))))[0] 1,
location_list = lambda x: x['location_list'].agg(','.join)
)
.assign(rest_of_world_index = lambda x: 'RoW_' x['rest_of_world_index'].astype(str)))
print (df)
product_group location_list rest_of_world_location_list rest_of_world_index
0 1 RoW,CA,US JP,BE,FR RoW_1
1 2 RoW,JP,US CA,BE,FR RoW_2
3 4 RoW,CA,US JP,BE,FR RoW_1