I have a dataframe having 2 columns A and B say, containing strings and integers respectively. For example, consider the following data.
df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"], 'B':[3, 2, 6]})
Now, I have to create another column C, where for each index i, df['C'][i]
will contain the string s, where s is the string staring from the df['B'][i]
-th character of the string df['A'][i]
. For the above example the output will be:
A B C
xxxdddrrrfvhdddfff 3 xdddrrrfvhdddfff
trdyuuweewy 2 rdyuuweewy
oooeereghtyuj 6 reghtyuj
This can be done using lambdas or for loops very easily.
My attempt:
df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
But my dataset is huge in size (contains around 5 million rows) - so using loops or lambdas are not efficient at all. How can I do this efficiently without using lambdas or loops? Any suggestion is highly appreciated. Thank you.
CodePudding user response:
You can avoid using pandas apply and make it more efficient using native python. Kindly try the following:
df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
I tested using 30000 rows and 1000 iterations:
df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"]*1000, 'B':[3, 2, 6]*1000})
times_zip = []
times_apply = []
for i in range(1000):
start = time.time()
df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
end = time.time()
times_zip.append(end-start)
for i in range(1000):
start = time.time()
df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
end = time.time()
times_apply.append(end-start)
The average time per execution using apply is:
0.035329506397247315
Whereas the average time using zip was:
0.0006626224517822265