I want to iterate through the data frame and assign value in new column "Minor-1" where each value satisfy certain conditions.
INPUT DATA :
import pandas as pd
data = {'Major' : ['A112','A112','B113','B113','B113','C114','C114','C114','C114'],'Minor' : ['X','Y','X','Y','Z','X','Y','Y','Z']}
df=pd.DataFrame(data)
DESIRED OUTPUT:
{'Major': {0: 'A112', 1: 'A112', 2: 'B113', 3: 'B113', 4: 'B113', 5: 'C114', 6: 'C114', 7: 'C114', 8: 'C114'},'Minor': {0: 'X', 1: 'Y', 2: 'X', 3: 'Y', 4: 'Z', 5: 'X', 6: 'Y', 7: 'Y', 8: 'Z'},'Minor-1': {0: 'Start', 1: 'X', 2: 'Start', 3: 'X', 4: 'Y', 5: 'Start', 6: 'X', 7: 'X', 8: 'Y'}}
Input Data Looks like : Input , Desired output looks like : Output
Code I am using currently:
df["Minor-1"]="Start"
for i in range(1,len(df)):
if df['Major'][i]==df.loc[i-1,'Major']:
if df['Minor'][i]!=df.loc[i-1,'Minor'] :
df.loc[i,"Minor-1"]=df['Minor'][i-1]
else:
df.loc[i,"Minor-1"]=df.loc[i-1,"Minor-1"]
Currently I'm doing it using for loop which I know is not time efficient at all. Which is the fastest way to perform this action? I checked iterrows and np.select but couldn't figure out how to get desired output with those methods.
CodePudding user response:
Here's another suggestion:
df["Minor-1"] = df["Minor"].shift()
df.loc[df["Minor"] == df["Minor"].shift(), "Minor-1"] = None
df.loc[df["Major"] != df["Major"].shift(), "Minor-1"] = "Start"
df["Minor-1"] = df["Minor-1"].ffill()
The shifted column Minor
is used as base for the new column Minor-1
. The new column is then adjusted by:
- The positions where a value in
Minor
is the same as the prior one are set toNone
: preparation for the.ffill()
application. - The positions in which in column
Major
the values switch are populated with"Start"
. - Finally
.ffill()
fills the gaps.
I've done some performance measurement with a bigger dataframe:
from timeit import timeit
def test_1(df):
df["Minor-1"] = "Start"
for i in range(1, len(df)):
if df['Major'][i]==df.loc[i-1,'Major']:
if df['Minor'][i]!=df.loc[i-1,'Minor'] :
df.loc[i,"Minor-1"]=df['Minor'][i-1]
else:
df.loc[i,"Minor-1"]=df.loc[i-1,"Minor-1"]
def test_2(df):
df["Minor-1"] = df["Minor"].shift()
df.loc[df["Minor"] == df["Minor"].shift(), "Minor-1"] = None
df.loc[df["Major"] != df["Major"].shift(), "Minor-1"] = "Start"
df["Minor-1"] = df["Minor-1"].ffill()
data = {
'Major' : ['A112', 'A112', 'B113', 'B113','B113', 'C114', 'C114', 'C114', 'C114'],
'Minor' : ['X', 'Y', 'X', 'Y', 'Z', 'X', 'Y', 'Y', 'Z']
}
df = pd.DataFrame(df)
df = pd.concat(df for _ in range(100)).reset_index(drop=True)
t1 = timeit("test_1(df)", globals=globals(), number=100)
t2 = timeit("test_2(df)", globals=globals(), number=100)
print(t1, t2)
Result: t1 = 37.15364322599635
and t2 = 0.5278295389944105
which seems quite an improvement to me.