Home > Net >  How to optimize dataframe iteration in pandas?
How to optimize dataframe iteration in pandas?

Time:10-04

I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.

df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
   row_ids = []
   for index, row in df.iterrows():
       if (index % 1000) == 0:
          print("Row node index: {}".format(str(index)))
     
     caculated_id = get_id(row['name', row['sex']])
     row_ids.append(caculated_id)

   df['id'] = row_ids

Is there a way to make it much faster without going row by row?

Add more info based on suggested solutions:

CodePudding user response:

Use apply instead:

def func(x):
    if (x.name % 1000) == 0:
        print("Row node index: {}".format(str(x.name)))
 
    caculated_id = get_id(row['name', row['sex']])
    return caculated_id

df['id'] = df.apply(func, axis=1)

CodePudding user response:

If you are working on a large dataset then np.vectorize() should help bypass the apply() overhead, which should be a bit faster.

import numpy as np

v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)

Edit:

To get even more of a speed up you could also just pass the function get_id instead of using a lambda function and pass df.*.values instead of df.*.

v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)

Instead of printing updates about the progression through the process try using tqdm to show the progression using a progress bar.

import numpy as np 
from tqdm import tqdm

@np.vectorize
def get_id(name, sex):
    global pbar
    ...
    pbar.update(1)
    ...
    return 


global pbar
with tqdm(total=len(df)) as pbar:
    df['id'] = get_id(df['name'].values, df['sex'].values)
  • Related