I have a dataframe which contains post and comments. Every comment has an id and a parent id (identifies the comment or post which the comment was a response to). Posts only have an id, since they don't answer to anything.
submission | id | parent id |
---|---|---|
post1 | 1 | |
comment1 | 2 | 1 |
comment2 | 3 | 1 |
comment3 | 4 | 2 |
comment4 | 5 | 4 |
post2 | 6 | |
comment5 | 7 | 6 |
I would like to retrieve the id of the original post for every comment and obtain something like that:
submission | id | parent id | ancestor id |
---|---|---|---|
post1 | 1 | ||
comment1 | 2 | 1 | 1 |
comment2 | 3 | 1 | 1 |
comment3 | 4 | 2 | 1 |
comment4 | 5 | 4 | 1 |
post2 | 6 | ||
comment5 | 7 | 6 | 2 |
to do so I tried to loop from the end of the dataframe to the beginning, iteratively tracing back the parent_id of the parent_id until I found an empty parent_id cell. On the test dataframe it works, but on the main one is too slow. Is there a way to make it more efficient?
Here my original code:
#creating a column for the id of the original post
df["ancestor"] = df.id
#obtaining the id of the original post for every comment
for i in reversed(range(len(df.id))): #looping trough the comments
id = df["parent_id"][i] #variable to initialize the future loop
parent = id
while parent != "": #only looping trough comments
df.ancestor[i] = id
parent = df.parent_id[df.id == id].values[0]
id = parent
CodePudding user response:
I'm not sure if this speeds things up, but using the NetworkX library might be worth a try:
import networkx as nx
G = nx.from_pandas_edgelist(
df[df["parent_id"].ne("")], source="parent_id", target="id"
)
roots = set(df.loc[df["parent_id"].eq(""), "id"])
mapping = {}
for comp in nx.connected_components(G):
root = (roots & comp).pop()
roots.discard(root)
mapping.update(dict.fromkeys(comp - {root}, root))
df["ancestor_id"] = df["id"].map(mapping)
- First read the
parent_id
-id
combos as edges in a graphG
. - Identify the root nodes
roots
of the trees (G
should be a forrest, from what I understand). - Then build a
mapping
nodes -> root (via the components ofG
), and apply it to the columnid
.
Result for the example:
submission id parent_id ancestor_id
0 post1 1 NaN
1 comment1 2 1 1
2 comment2 3 1 1
3 comment3 4 2 1
4 comment4 5 4 1
5 post2 6 NaN
6 comment5 7 6 6
CodePudding user response:
You can create a recursive function to target the root id:
def ancestor(row):
if df.loc[df['id'] == row['id']]['parent id'].values:
new_dict = {k: [x for x in (vv for _, vv in v.items())][0]
for k, v in df.loc[df['id'] == row['parent id']].to_dict().items()}
return ancestor(pd.Series(new_dict))
return row['id']
df['ancestor'] = df.apply(lambda x: ancestor(x) if x['parent id'] else "", axis=1)
OUTPUT
submission id parent id ancestor
0 post1 1
1 comment1 2 1.0 1
2 comment2 3 1.0 1
3 comment3 4 2.0 1
4 comment4 5 4.0 1
5 post2 6
6 comment5 7 6.0 6