Pandas how to create new colum in DF1 with values of column in DF2 based on conditions-CodePudding

I have 2 DFs that have different sizes, something like:

DF1:

Metric Market BSegment MSegment Category Num Den Period
 KPI1   GA     PS       Comm      BF     20  100 202201
 KPI1   CE     PR       Cons      BF     70  128 202201
 KPI2   NWE    PR       Cons      On     12  389 202203
  .
  .
  .
 KPI20  ISE    PS       Cons      On     80  288 202204
 KPI30  GA     PS       Comm      BF     32  85  202204

DF2:

Metric Market BSegment MSegment Category StartPeriod EndPeriod Goal
 KPI1   *      PS       Comm      BF        202201     202205   5
 KPI1   *      PR       Cons      BF        202201     202205   10
 KPI2   NWE    PR        *        On        202112     202203   60
  .
  .
  .
 KPI2   ISE    PS        *        On        202112     202203   40
 KPI3   GA     PS       Comm      BF        202201     202206   22
 KPI3   CE     PS       Comm      BF        202201     202206   26
 KPI3   NWE    PS       Comm      BF        202201     202206   20

I am trying to get a new column in DF1 where I bring the values of the 'Goal' column in DF2 when the other columns are matchig. The main problem is that in DF2, in order to not repeat the same line multiple times, when a goal is applied, for example, to all markets instead of a line per market it states '*'.

Also there could be metrics in DF1 that are not appearing in DF2 as there are no goals.

The logic would be something like:

(df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') & 
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] )

I've tried with np.where and np.select as:

df1['Numerator.Goal'] = (np.where((df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') & 
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] ), df2['Goal'], np.nan))
--------
df1['Numerator.Goal'] = (np.select((df1['Metric'] == df2['Metric']) &
(df1['Market'] == df2['Market'] | df2['Market'] == '*') & 
(df1['BSegment'] == df2['BSegment'] | df2['BSegment'] == '*') &
(df1['MSegment'] == df2['MSegment'] | df2['MSegment'] == '*') &
(df1['Category'] == df2['Category'] | df2['Category'] == '*') &
(df2['StartPeriod'] <= df1['Period'] <= df2['EndPeriod'] ), df2['Goal']))

but it ends up with a "ValueError: Can only compare identically-labeled Series objects"

I've think also on doing a .merge() but I havn't seen how to deal with the '*' symbol so I have not been successful at it.

By any chance, would you know how to accomplish this? Any help is appreciated!

CodePudding user response：

So let's say you have these two dataframes:

import pandas as pd

df = pd.DataFrame({'Market': ['GA', 'CE'], 
                   'Period': [1, 2]})
df2 = pd.DataFrame({'Market': ['CE', 'MWE', 'GA', '*'], 
                    'Period': [2, 3, 4, 1], 
                    'Goal': [21, 22, 23, 24]})

# df
  Market  Period
0     GA       1
1     CE       2

# df2
  Market  Period  Goal
0     CE       2    21
1    MWE       3    22
2     GA       4    23
3      *       1    24

What you want is to add the Goal values to the first 'df'. Merging gives

>>> df.merge(df2)
  Market  Period  Goal
0     CE       2    21

but you want

  Market  Period  Goal
0     GA       1    24
1     CE       2    21

because what the * actually means is:

# explicit meaning of the *-row
3     CE       1    24
3    MWE       1    24
3     GA       1    24

So one way to make it work is to replace the wildcard with the explicit values and merge afterwards. Here is one naive way to do it, there are probably more reasonable/pandaesque strategies with some additional merging, but for the sake of argument:

markets = ['CE', 'MWE', 'GA']

# Expand the wildcard '*' by replacing each occurrence with all markets
df2['Market'] = df2.Market.apply(lambda x: markets if x == '*' else x)

# Explode the lists into individual rows
df2 = df2.explode('Market')

The result of this is this variant of df2:

  Market  Period  Goal
0     CE       2    21
1    MWE       3    22
2     GA       4    23
3     CE       1    24
3    MWE       1    24
3     GA       1    24

Now, df.merge(df2) results in

  Market  Period  Goal
0     GA       1    24
1     CE       2    21

The same applies to other columns containing the wildcard *. The Period merging is really a separate question altogether. Check out this question, for example.

CodePudding user response：

indexes = ["Metric" ,'Market',  'BSegment'  ,'MSegment',    "Category"]

def get(df, boolean_indexes):
    for index, boolean_index in zip(df.index, boolean_indexes):
        yield not np.any(df.iloc[index][boolean_index].values != "*")
def find(values, period):
    for value in values:
        if value[-3] <= period <= value[-2]:
            return value[-1]
df1.apply(lambda x: find(df2.loc[list(get(df2[indexes], df2[indexes].values != x[indexes].values))][['StartPeriod', 'EndPeriod' ,'Goal']].values, x["Period"]), axis=1)