Row wise operation on Pandas dataframe-CodePudding

I want to iterate through the data frame and assign value in new column "Minor-1" where each value satisfy certain conditions.

INPUT DATA :

    import pandas as pd
    data = {'Major' : ['A112','A112','B113','B113','B113','C114','C114','C114','C114'],'Minor' : ['X','Y','X','Y','Z','X','Y','Y','Z']}
    df=pd.DataFrame(data)

DESIRED OUTPUT:

    {'Major': {0: 'A112', 1: 'A112', 2: 'B113', 3: 'B113', 4: 'B113', 5: 'C114', 6: 'C114', 7: 'C114', 8: 'C114'},'Minor': {0: 'X', 1: 'Y', 2: 'X', 3: 'Y', 4: 'Z', 5: 'X', 6: 'Y', 7: 'Y', 8: 'Z'},'Minor-1': {0: 'Start', 1: 'X', 2: 'Start', 3: 'X', 4: 'Y', 5: 'Start', 6: 'X', 7: 'X', 8: 'Y'}}

Input Data Looks like : Input , Desired output looks like : Output

Code I am using currently:

    df["Minor-1"]="Start"
    for i in range(1,len(df)):
        if df['Major'][i]==df.loc[i-1,'Major']:
            if df['Minor'][i]!=df.loc[i-1,'Minor'] :
                df.loc[i,"Minor-1"]=df['Minor'][i-1]
            else:
                df.loc[i,"Minor-1"]=df.loc[i-1,"Minor-1"]

Currently I'm doing it using for loop which I know is not time efficient at all. Which is the fastest way to perform this action? I checked iterrows and np.select but couldn't figure out how to get desired output with those methods.

CodePudding user response：

Here's another suggestion:

df["Minor-1"] = df["Minor"].shift()
df.loc[df["Minor"] == df["Minor"].shift(), "Minor-1"] = None
df.loc[df["Major"] != df["Major"].shift(), "Minor-1"] = "Start"
df["Minor-1"] = df["Minor-1"].ffill()

The shifted column Minor is used as base for the new column Minor-1. The new column is then adjusted by:

The positions where a value in Minor is the same as the prior one are set to None: preparation for the .ffill() application.
The positions in which in column Major the values switch are populated with "Start".
Finally .ffill() fills the gaps.

I've done some performance measurement with a bigger dataframe:

from timeit import timeit

def test_1(df):
    df["Minor-1"] = "Start"
    for i in range(1, len(df)):
        if df['Major'][i]==df.loc[i-1,'Major']:
            if df['Minor'][i]!=df.loc[i-1,'Minor'] :
                df.loc[i,"Minor-1"]=df['Minor'][i-1]
            else:
                df.loc[i,"Minor-1"]=df.loc[i-1,"Minor-1"]    

def test_2(df):
    df["Minor-1"] = df["Minor"].shift()
    df.loc[df["Minor"] == df["Minor"].shift(), "Minor-1"] = None
    df.loc[df["Major"] != df["Major"].shift(), "Minor-1"] = "Start"
    df["Minor-1"] = df["Minor-1"].ffill()

data = {
    'Major' : ['A112', 'A112', 'B113', 'B113','B113', 'C114', 'C114', 'C114', 'C114'],
    'Minor' : ['X', 'Y', 'X', 'Y', 'Z', 'X', 'Y', 'Y', 'Z']
}
df = pd.DataFrame(df)
df = pd.concat(df for _ in range(100)).reset_index(drop=True)

t1 = timeit("test_1(df)", globals=globals(), number=100)
t2 = timeit("test_2(df)", globals=globals(), number=100)

print(t1, t2)

Result: t1 = 37.15364322599635 and t2 = 0.5278295389944105 which seems quite an improvement to me.