Home > database >  Iterating over results of .Itertuples() - why is this slow?
Iterating over results of .Itertuples() - why is this slow?

Time:05-23

df -> ["user_id", "num_posts", "posts" ...]

My df is made of rows containing data for reddit user-accounts; where for each row "posts" contains a series of separate posts by that user.

The number of posts ranges up to 6000 for certain users.

data = pd.DataFrame(columns=["user_id","posts"])
for row in df.itertuples():         
    for post in row[ : len(row[3])]:  
        new_row = [row[1], post ]
        data.loc[len(data)] = new_row

It seems the inner for-loop, that iterates over results from itertuples makes this terribly slow!

Even if I cap the maximum number of posts to be grabbed for a single user with 100, the code doesn't return for hours even running on a high powered remote server!

Any thoughts on how to improve the runtime?

CodePudding user response:

I've tested your code verse 'concat' method with list comprehension' and I've got it 12 times faster with list comprehension:

data = pd.concat([pd.DataFrame([[row[1], post] for post in row], columns=["user_id", "posts"])
                  for row in df.itertuples()], ignore_index=True)
  • Related