Data wrangling in Python, calculate value from some conditions-CodePudding

I have a dataframe in Python below:

import pandas as pd
df = pd.DataFrame({
    'CRDACCT_DLQ_CYC_1_MNTH_AGO' : [3, 2, 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'], 
    'CRDACCT_DLQ_CYC_2_MNTH_AGO': [4, 3, 3, 3, 3, 3, 2, 0, 5, 4, 3, 2, 0, 2, 2, 2, 2, 2, 2, 0, 2, 2, 0, 2], 
    'CRDACCT_DLQ_CYC_3_MNTH_AGO': [8, 7, 6, 5, 4, 3, 2, 'F', 'F', 0, 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'F', 'C', 'C', 'F', 'F'], 
    'CRDACCT_DLQ_CYC_4_MNTH_AGO' : [0, 2, 'F', 'F', 'C', 'C', 'C', 'C', 0, 2, 0, 2, 0, 2, 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'F', 'C', 'F'], 
    'CRDACCT_DLQ_CYC_5_MNTH_AGO' : [2, 2, 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'], 
    'CRDACCT_DLQ_CYC_6_MNTH_AGO' : [2, 2, 2, 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 0, 2, 0, 2, 0], 
    'CRDACCT_DLQ_CYC_7_MNTH_AGO' : [3, 3, 2, 'C', 'C', 'C', 'F', 0, 6, 5, 4, 3, 2, 2, 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C', 'C'], 
    'CRDACCT_DLQ_CYC_8_MNTH_AGO' : [5, 4, 4, 3, 3, 2, 3, 2, 2, 2, 1, 2, 0, 2, 'C', 'C', 0, 2, 2, 2, 'C', 'C', 0, 'Z'], 
    'CRDACCT_DLQ_CYC_9_MNTH_AGO' : [2, 2, 'C', 0, 2, 0, 2, 'C', 'C', 'C', 'C', 'C', 0, 3, 2, 'C', 'F', 'C', 'F', 'F', 'F', 'F', 'F', 'F'], 
    'CRDACCT_DLQ_CYC_10_MNTH_AGO' : [5, 4, 3, 2, 3, 2, 0, 2, 0, 2, 'C', 'C', 'F', 2, 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'F', 'C'], 
    'CRDACCT_DLQ_CYC_11_MNTH_AGO' : [4, 3, 2, 'F', 2, 0, 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z', 'Z'], 
    'CRDACCT_DLQ_CYC_12_MNTH_AGO' : ['F', 8, 7, 6, 5, 4, 3, 2, 'C', 'C', 'C', 0, 2, 'C', 'C', 0, 2, 0, 3, 2, 'C', 'C', 'F', 2]
})

df.head()

I want to convert those values (string value: C, F, and Z) into some categories with this condition: if values in column CRDACCT_DLQ_CYC_1_MNTH_AGO, CRDACCT_DLQ_CYC_2_MNTH_AGO, ......., CRDACCT_DLQ_CYC_12_MNTH_AGO consist:

C = 0
F = 0
Z = 0
else value  = value 

#Convert value
df = df.replace({'C': 0, 'F': 0, 'Z': 0,' ':0}).astype(int)

Then, I want to create a new column with the name of MSD. MSD stands for Month Since Delinquent. MSD is calculated by identifying each of 12 columns CRDACCT_DLQ_CYC_1_MNTH_AGO, CRDACCT_DLQ_CYC_2_MNTH_AGO, .......up until CRDACCT_DLQ_CYC_12_MNTH_AGO with this kind of condition:

If value in CRDACCT_DLQ_CYC_1_MNTH_AGO > 1 then MSD = 1, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_2_MNTH_AGO > 1 then MSD = 2, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_3_MNTH_AGO > 1 then MSD = 3, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_4_MNTH_AGO > 1 then MSD = 4, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_5_MNTH_AGO > 1 then MSD = 5, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_6_MNTH_AGO > 1 then MSD = 6, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_7_MNTH_AGO > 1 then MSD = 7, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_8_MNTH_AGO > 1 then MSD = 8, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_9_MNTH_AGO > 1 then MSD = 9, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_10_MNTH_AGO > 1 then MSD = 10, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_11_MNTH_AGO > 1 then MSD = 11, otherwise MSD=0 or
If value in CRDACCT_DLQ_CYC_12_MNTH_AGO > 1 then MSD = 12, otherwise MSD=0
Note: otherwise if value 1 and 0, then MSD = 0.

For example:

index 0, MSD =1,because value 3 > 1 is in CRDACCT_DLQ_CYC_1_MNTH_AGO (we no need to check CRDACCT_DLQ_CYC_2_MNTH_AGO > 1 because we have found month since delinquent in CRDACCT_DLQ_CYC_1_MNTH_AGO) , hence MSD is in 1 MNTH AGO
index 1, MSD=1,
index 2, MSD=2,
index 3, MSD=2, because value 3 > 1 is in CRDACCT_DLQ_CYC_2_MNTH_AGO, hence MSD is in 2 MNTH AGO
index 4, MSD=2

Note: by checking each 12 columns with those conditions, If all values = 0 in each column CRDACCT_DLQ_CYC_1_MNTH_AGO, .....and CRDACCT_DLQ_CYC_12_MNTH_AGO, then MSD should be = 0.

Generally it is to check value > 1 in each 12 columns then determine the MSD value based on column name CRDACCT_DLQ_CYC_x_MNTH_AGO, x will be the value of MSD if > 1.

CodePudding user response：

It ain't pretty but this one-liner should do the trick ;)

df['MSD'] = (df > 1).astype(int).apply(lambda row: int(row.idxmax().split('_')[3]) if row.sum() >=1 else 0, axis=1)

basically - check which values are over 1, get the first column for each row which is above one (the MSD as you defined it), and don't forget to check the edge case when it is 0.

CodePudding user response：

I have tried to understand what problem/s are you having, and from what I have understood, it seems you want to achieve two results:

You want to replace the values: "C", "F", and "Z" in the DataFrame whenever they exist with an integer value of 0, otherwise leave the value alone, this can be done with the following:

df.replace(to_replace=["C", "F", "Z"], value=0, inplace=True)
# setting the parameter 'inplace' to the value True to apply the transformation to the requested DataFrame: 'df'

After this, you want to have a new column of the type integer, labeled "MSD", and for the values it should hold you want it to be the following:
for each row in the DataFrame, look for the value under each column in order, and if encountered finding any value greater than 1, stop looking the remaining columns' values, and for the column at which this was encountered, you extract the numerical value from the column's label and assign it to the column "MSD" in the same row. This can be done as follows:

def numberInColumnLabel(columnLabel):
    """
    function that extracts numerical value from given string in the format: CRDACCT_DLQ_CYC_[x]_MNTH_AGO , where [x] is the numerical value, and returns it.
    """
    phraseBeforeNumber = "CYC_"
    phraseAfterNumber = "_MNTH"
    
    numberStartingIndex = columnLabel.find(phraseBeforeNumber)   len(phraseBeforeNumber)
    numberEndingIndex = columnLabel.find(phraseAfterNumber)
    
    number = int( columnLabel[numberStartingIndex:numberEndingIndex] )
    return number

df["MSD"] = 0   # initialize column 'MSD' with a default value of 0

for rowIndex in range(0, df.shape[0]): #iterate through each row's index
    for columnLabel in df.columns: #iterate through each column label for that row
        if(int(df[columnLabel][rowIndex]) > 1):
            df.loc[rowIndex, "MSD"] = numberInColumnLabel(columnLabel)
            break