I am currently working with a very large dataset (>70mil rows, 10 columns), it involves gap fills, forwards fills, reindexing, etc. But the step that takes the most time (over 50% of run time) is the simple code of replacing a column variable with the value of two columns combined as strings. Example code would be:
df["id_date"] = df['id'].astype(str) "_" df["date"].astype(str)
Is there a way to improve the speed of this step? I am surprised this takes such a great deal longer than what thought would be more complex steps.
CodePudding user response:
Take a look at Series.str.cat
:
df['id_date'] = df['id'].str.cat(df["date"], sep='_')
That being said, as with any redundant information, you are likely better off just not having this column, or at least only creating the data on demand instead of up front.