iteratively retrieving information contained in different rows in pandas dataframe-CodePudding

I have a dataframe which contains post and comments. Every comment has an id and a parent id (identifies the comment or post which the comment was a response to). Posts only have an id, since they don't answer to anything.

submission	id	parent id
post1	1
comment1	2	1
comment2	3	1
comment3	4	2
comment4	5	4
post2	6
comment5	7	6

I would like to retrieve the id of the original post for every comment and obtain something like that:

submission	id	parent id	ancestor id
post1	1
comment1	2	1	1
comment2	3	1	1
comment3	4	2	1
comment4	5	4	1
post2	6
comment5	7	6	2

to do so I tried to loop from the end of the dataframe to the beginning, iteratively tracing back the parent_id of the parent_id until I found an empty parent_id cell. On the test dataframe it works, but on the main one is too slow. Is there a way to make it more efficient?

Here my original code:

#creating a column for the id of the original post
df["ancestor"] = df.id
#obtaining the id of the original post for every comment
for i in reversed(range(len(df.id))): #looping trough the comments
  id = df["parent_id"][i] #variable to initialize the future loop
  parent = id
  while parent != "": #only looping trough comments
    df.ancestor[i] = id
    parent = df.parent_id[df.id == id].values[0]
    id = parent

CodePudding user response：

I'm not sure if this speeds things up, but using the NetworkX library might be worth a try:

import networkx as nx

G = nx.from_pandas_edgelist(
    df[df["parent_id"].ne("")], source="parent_id", target="id"
)

roots = set(df.loc[df["parent_id"].eq(""), "id"])
mapping = {}
for comp in nx.connected_components(G):
    root = (roots & comp).pop()
    roots.discard(root)
    mapping.update(dict.fromkeys(comp - {root}, root))

df["ancestor_id"] = df["id"].map(mapping)

First read the parent_id-id combos as edges in a graph G.
Identify the root nodes roots of the trees (G should be a forrest, from what I understand).
Then build a mapping nodes -> root (via the components of G), and apply it to the column id.

Result for the example:

  submission id parent_id ancestor_id
0      post1  1                   NaN
1   comment1  2         1           1
2   comment2  3         1           1
3   comment3  4         2           1
4   comment4  5         4           1
5      post2  6                   NaN
6   comment5  7         6           6

CodePudding user response：

You can create a recursive function to target the root id:

def ancestor(row):
    if df.loc[df['id'] == row['id']]['parent id'].values:
        new_dict = {k: [x for x in (vv for _, vv in v.items())][0]
                    for k, v in df.loc[df['id'] == row['parent id']].to_dict().items()}
        return ancestor(pd.Series(new_dict))
    return row['id']


df['ancestor'] = df.apply(lambda x: ancestor(x) if x['parent id'] else "", axis=1)

OUTPUT

  submission  id parent id ancestor
0      post1   1
1   comment1   2       1.0        1
2   comment2   3       1.0        1
3   comment3   4       2.0        1
4   comment4   5       4.0        1
5      post2   6
6   comment5   7       6.0        6