Home > Back-end >  Subsetting strings present in a column of a dataframe, depending on value of another column - Pandas
Subsetting strings present in a column of a dataframe, depending on value of another column - Pandas

Time:08-03

I have a dataframe having 2 columns A and B say, containing strings and integers respectively. For example, consider the following data.

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"], 'B':[3, 2, 6]})

Now, I have to create another column C, where for each index i, df['C'][i] will contain the string s, where s is the string staring from the df['B'][i]-th character of the string df['A'][i]. For the above example the output will be:

            A         B                C
xxxdddrrrfvhdddfff    3    xdddrrrfvhdddfff
trdyuuweewy           2    rdyuuweewy 
oooeereghtyuj         6    reghtyuj
  

This can be done using lambdas or for loops very easily.

My attempt:

df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)

But my dataset is huge in size (contains around 5 million rows) - so using loops or lambdas are not efficient at all. How can I do this efficiently without using lambdas or loops? Any suggestion is highly appreciated. Thank you.

CodePudding user response:

You can avoid using pandas apply and make it more efficient using native python. Kindly try the following:

df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]

I tested using 30000 rows and 1000 iterations:

df = pd.DataFrame({'A': ["xxxdddrrrfvhdddfff", "trdyuuweewy", "oooeereghtyuj"]*1000, 'B':[3, 2, 6]*1000})
times_zip = []
times_apply = []

for i in range(1000):
    start = time.time()
    df['C'] = [x[y-1:] for x,y in zip(df['A'],df['B'])]
    end = time.time()
    times_zip.append(end-start)
    
for i in range(1000):
    start = time.time()
    df['C']=df.apply(lambda x: x.A[x['B']:], axis=1)
    end = time.time()
    times_apply.append(end-start)

The average time per execution using apply is:

0.035329506397247315

Whereas the average time using zip was:

0.0006626224517822265
  • Related