Having difficulty isolating data from other dataframes while vectorizing/looping through a dataframe.
df
:
NAME | CURRENT_BAL | AVAIL_BAL | TYPE |
---|---|---|---|
Abc | 1,000 | 2,000 | TypeA |
Def | 1,000 | 2,000 | TypeB |
settings
:
Scenario | TypeA | TypeB |
---|---|---|
Default | 1 | 2 |
df
- desired output
NAME | CURRENT_BAL | AVAIL_BAL | TYPE | 02/28/2023 | 03/31/2023 |
---|---|---|---|---|---|
Abc | 1,000 | 2,000 | TypeA | 2,000 | 0 |
Def | 1,000 | 2,000 | TypeB | 1,000 | 1,000 |
The settings
df tells the script how to spread out the Available Balance over the following periods. This is a reduced version of it and I'm looking to use specific month-end dates and longer periods with a variety of settings.
def build_table(df: pd.DataFrame, periods: int = 2, scenario: str = 'Default'):
from datetime import date
def create_entry(df: pd.DataFrame, i: int = 1, start_period: int = 1):
number_periods = settings.loc[settings.Scenario == 'Default', df.Type]
return (df.Avail_Bal / number_periods) if i <= number_periods else 0
for i in range(periods):
df[ (date.toady() pd.offsets.MonthEnd() pd.DateOffset(months=i)).date() ] = \
create_entry(df, i, scenario)
return df
As of now, running into errors with this line in trying to get a conditional value related to 1 row of the dataframe while its being vectorized:
number_periods = settings.loc[settings.Scenario == 'Default', df.Type]
ValueError: The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()
There is a lot of work to be done on this project, but for now I'm trying to understand where I can use vectorization, and when I have to iterate through the rows.
CodePudding user response:
I believe you want to use .apply()
to apply the create_entry
function to your dataframe. In addition the result of your .loc
method is a series, so you need to select the actual value (I am doing this with .max()
df = pd.DataFrame({
'NAME': ['Abc', 'Def'],
'CURRENT_BAL': [1000, 1000],
'AVAIL_BAL': [2000, 2000],
'TYPE': ['TypeA', 'TypeB']
})
settings = pd.DataFrame({'Scenario': ['Default'], 'TypeA': [1], 'TypeB': [2]})
def create_entry(row, i):
number_periods = settings.loc[settings.Scenario == 'Default', row.TYPE].max()
return (row.AVAIL_BAL / number_periods) if i <= number_periods else 0
for i in range(1, 3):
date_col = (date.today() pd.offsets.MonthEnd() pd.DateOffset(months=i)).date()
df[date_col] = df.apply(lambda x: create_entry(x, i), axis=1)
output:
NAME CURRENT_BAL AVAIL_BAL TYPE 2023-03-28 2023-04-28
0 Abc 1000 2000 TypeA 2000.0 0.0
1 Def 1000 2000 TypeB 1000.0 1000.0