I have a dataframe column that I am trying to iterate through using a for loop:
So with my loop, I am using the python index to get the max value for my iteration but code runs a bit slow and prints a Key error when it reaches the last index. I am trying to loop through the dataframe column final_df["MAN"] and compare the current index with the previous index during the loop, and if it is equal put 0 in the new column created final_df["MAN_ID"] and 1 if it is not equal to. An optimization to my code will be appreciated.
%%time
#Sort by MAN column
final_df.sort_values(by=['MAN'])
#Loop through the column
final_df["MAN_ID"] = ""
try:
for i in final_df.index:
if final_df["MAN"][i 1] == final_df["MAN"][i]:
final_df["MAN_ID"][i] = 0
elif final_df["MAN"][i 1] != final_df["MAN"][i]:
final_df["MAN_ID"][i] = 1
except:
print("No value to loop")
CodePudding user response:
Do not iterate, this is slow.
Use vector operations: shift
to shift the index, eq
to perform the comparison, numpy.where
to assign 0/1 depending on the equality.
import pandas as pd
import numpy as np
# dummy example
df = pd.DataFrame({'MAN': list('AABBABBBAA')})
df['MAN_ID'] = np.where(df['MAN'].eq(df['MAN'].shift(-1)), 0, 1)
output:
MAN MAN_ID
0 A 0
1 A 1
2 B 0
3 B 1
4 A 1
5 B 0
6 B 0
7 B 1
8 A 0
9 A 1
Alternatively, you can use:
(df['MAN'].ne(df['MAN'].shift(-1))).astype(int)
This outputs True if the values are different and False is they are identical (using the ne
operator), then by converting to int, True becomes 1 and False 0.