I need to iterate a dataframe, for each row I need to create a ID based on two existing columns: name and sex. Eventually I add this new column to the df.
df = pd.read_csv(file, sep='\t', dtype=str, na_values="", low_memory=False)
row_ids = []
for index, row in df.iterrows():
if (index % 1000) == 0:
print("Row node index: {}".format(str(index)))
caculated_id = get_id(row['name', row['sex']])
row_ids.append(caculated_id)
df['id'] = row_ids
Is there a way to make it much faster without going row by row?
Add more info based on suggested solutions:
CodePudding user response:
Use apply
instead:
def func(x):
if (x.name % 1000) == 0:
print("Row node index: {}".format(str(x.name)))
caculated_id = get_id(row['name', row['sex']])
return caculated_id
df['id'] = df.apply(func, axis=1)
CodePudding user response:
If you are working on a large dataset then np.vectorize()
should help bypass the apply()
overhead, which should be a bit faster.
import numpy as np
v = np.vectorize(lambda x: get_id(x['name'], x['sex']))
df['id'] = v(df)
Edit:
To get even more of a speed up you could also just pass the function get_id
instead of using a lambda function and pass df.*.values
instead of df.*
.
v = np.vectorize(get_id)
df['id'] = v(df['name'].values, df['sex'].values)
Instead of printing updates about the progression through the process try using tqdm
to show the progression using a progress bar.
import numpy as np
from tqdm import tqdm
@np.vectorize
def get_id(name, sex):
global pbar
...
pbar.update(1)
...
return
global pbar
with tqdm(total=len(df)) as pbar:
df['id'] = get_id(df['name'].values, df['sex'].values)