Subsetting strings present in a column of a dataframe, depending on value of another column

I have a dataframe having 2 columns A and B say, containing strings and integers respectively. For example, consider the following data.

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"], 'B':[3, 2, 6]})

Now, I have to create another column C, where for each index i, df['C'][i] will contain the string s, where s is the string staring from the df['B'][i]-th character of the string df['A'][i]. For the above example the output will be:

            A         B                C
xxxdddrrrfvhdddfff    3    xdddrrrfvhdddfff
trdyuuweewy           2    rdyuuweewy 
oooeereghtyuj         6    reghtyuj

This can be done using lambdas or for loops very easily.

My attempt:

df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)

But my dataset is huge in size (contains around 5 million rows) - so using loops or lambdas are not efficient at all. How can I do this efficiently without using lambdas or loops? Any suggestion is highly appreciated. Thank you.

CodePudding user response：

You can avoid using pandas apply and make it more efficient using native python. Kindly try the following:

df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]

I tested using 30000 rows and 1000 iterations:

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"]*1000, 'B':[3, 2, 6]*1000})
times_zip = []
times_apply = []

for i in range(1000):
    start = time.time()
    df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
    end = time.time()
    times_zip.append(end-start)
    
for i in range(1000):
    start = time.time()
    df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
    end = time.time()
    times_apply.append(end-start)

The average time per execution using apply is:

0.035329506397247315

Whereas the average time using zip was:

0.0006626224517822265