I have 2 DFs that have different sizes, something like:
DF1:
Metric Market BSegment MSegment Category Num Den Period
KPI1 GA PS Comm BF 20 100 202201
KPI1 CE PR Cons BF 70 128 202201
KPI2 NWE PR Cons On 12 389 202203
.
.
.
KPI20 ISE PS Cons On 80 288 202204
KPI30 GA PS Comm BF 32 85 202204
DF2:
Metric Market BSegment MSegment Category StartPeriod EndPeriod Goal
KPI1 * PS Comm BF 202201 202205 5
KPI1 * PR Cons BF 202201 202205 10
KPI2 NWE PR * On 202112 202203 60
.
.
.
KPI2 ISE PS * On 202112 202203 40
KPI3 GA PS Comm BF 202201 202206 22
KPI3 CE PS Comm BF 202201 202206 26
KPI3 NWE PS Comm BF 202201 202206 20
I am trying to get a new column in DF1 where I bring the values of the 'Goal' column in DF2 when the other columns are matchig. The main problem is that in DF2, in order to not repeat the same line multiple times, when a goal is applied, for example, to all markets instead of a line per market it states '*'.
Also there could be metrics in DF1 that are not appearing in DF2 as there are no goals.
The logic would be something like:
(df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') &
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] )
I've tried with np.where and np.select as:
df1['Numerator.Goal'] = (np.where((df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') &
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] ), df2['Goal'], np.nan))
--------
df1['Numerator.Goal'] = (np.select((df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') &
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] ), df2['Goal']))
but it ends up with a "ValueError: Can only compare identically-labeled Series objects"
I've think also on doing a .merge() but I havn't seen how to deal with the '*' symbol so I have not been successful at it.
By any chance, would you know how to accomplish this? Any help is appreciated!
CodePudding user response:
So let's say you have these two dataframes:
import pandas as pd
df = pd.DataFrame({'Market': ['GA', 'CE'],
'Period': [1, 2]})
df2 = pd.DataFrame({'Market': ['CE', 'MWE', 'GA', '*'],
'Period': [2, 3, 4, 1],
'Goal': [21, 22, 23, 24]})
# df
Market Period
0 GA 1
1 CE 2
# df2
Market Period Goal
0 CE 2 21
1 MWE 3 22
2 GA 4 23
3 * 1 24
What you want is to add the Goal
values to the first 'df'. Merging gives
>>> df.merge(df2)
Market Period Goal
0 CE 2 21
but you want
Market Period Goal
0 GA 1 24
1 CE 2 21
because what the *
actually means is:
# explicit meaning of the *-row
3 CE 1 24
3 MWE 1 24
3 GA 1 24
So one way to make it work is to replace the wildcard with the explicit values and merge afterwards. Here is one naive way to do it, there are probably more reasonable/pandaesque strategies with some additional merging, but for the sake of argument:
markets = ['CE', 'MWE', 'GA']
# Expand the wildcard '*' by replacing each occurrence with all markets
df2['Market'] = df2.Market.apply(lambda x: markets if x == '*' else x)
# Explode the lists into individual rows
df2 = df2.explode('Market')
The result of this is this variant of df2
:
Market Period Goal
0 CE 2 21
1 MWE 3 22
2 GA 4 23
3 CE 1 24
3 MWE 1 24
3 GA 1 24
Now, df.merge(df2)
results in
Market Period Goal
0 GA 1 24
1 CE 2 21
The same applies to other columns containing the wildcard *
. The Period
merging is really a separate question altogether. Check out this question, for example.
CodePudding user response:
indexes = ["Metric" ,'Market', 'BSegment' ,'MSegment', "Category"]
def get(df, boolean_indexes):
for index, boolean_index in zip(df.index, boolean_indexes):
yield not np.any(df.iloc[index][boolean_index].values != "*")
def find(values, period):
for value in values:
if value[-3] <= period <= value[-2]:
return value[-1]
df1.apply(lambda x: find(df2.loc[list(get(df2[indexes], df2[indexes].values != x[indexes].values))][['StartPeriod', 'EndPeriod' ,'Goal']].values, x["Period"]), axis=1)