df -> ["user_id", "num_posts", "posts" ...]
My df is made of rows containing data for reddit user-accounts; where for each row "posts" contains a series of separate posts by that user.
The number of posts ranges up to 6000 for certain users.
data = pd.DataFrame(columns=["user_id","posts"])
for row in df.itertuples():
for post in row[ : len(row[3])]:
new_row = [row[1], post ]
data.loc[len(data)] = new_row
It seems the inner for-loop, that iterates over results from itertuples makes this terribly slow!
Even if I cap the maximum number of posts to be grabbed for a single user with 100, the code doesn't return for hours even running on a high powered remote server!
Any thoughts on how to improve the runtime?
CodePudding user response:
I've tested your code verse 'concat' method with list comprehension' and I've got it 12 times faster with list comprehension:
data = pd.concat([pd.DataFrame([[row[1], post] for post in row], columns=["user_id", "posts"])
for row in df.itertuples()], ignore_index=True)