Home > Mobile >  iteratively retrieving information contained in different rows in pandas dataframe
iteratively retrieving information contained in different rows in pandas dataframe

Time:08-15

I have a dataframe which contains post and comments. Every comment has an id and a parent id (identifies the comment or post which the comment was a response to). Posts only have an id, since they don't answer to anything.

submission id parent id
post1 1
comment1 2 1
comment2 3 1
comment3 4 2
comment4 5 4
post2 6
comment5 7 6

I would like to retrieve the id of the original post for every comment and obtain something like that:

submission id parent id ancestor id
post1 1
comment1 2 1 1
comment2 3 1 1
comment3 4 2 1
comment4 5 4 1
post2 6
comment5 7 6 2

to do so I tried to loop from the end of the dataframe to the beginning, iteratively tracing back the parent_id of the parent_id until I found an empty parent_id cell. On the test dataframe it works, but on the main one is too slow. Is there a way to make it more efficient?

Here my original code:

#creating a column for the id of the original post
df["ancestor"] = df.id
#obtaining the id of the original post for every comment
for i in reversed(range(len(df.id))): #looping trough the comments
  id = df["parent_id"][i] #variable to initialize the future loop
  parent = id
  while parent != "": #only looping trough comments
    df.ancestor[i] = id
    parent = df.parent_id[df.id == id].values[0]
    id = parent

CodePudding user response:

I'm not sure if this speeds things up, but using the NetworkX library might be worth a try:

import networkx as nx

G = nx.from_pandas_edgelist(
    df[df["parent_id"].ne("")], source="parent_id", target="id"
)

roots = set(df.loc[df["parent_id"].eq(""), "id"])
mapping = {}
for comp in nx.connected_components(G):
    root = (roots & comp).pop()
    roots.discard(root)
    mapping.update(dict.fromkeys(comp - {root}, root))

df["ancestor_id"] = df["id"].map(mapping)
  • First read the parent_id-id combos as edges in a graph G.
  • Identify the root nodes roots of the trees (G should be a forrest, from what I understand).
  • Then build a mapping nodes -> root (via the components of G), and apply it to the column id.

Result for the example:

  submission id parent_id ancestor_id
0      post1  1                   NaN
1   comment1  2         1           1
2   comment2  3         1           1
3   comment3  4         2           1
4   comment4  5         4           1
5      post2  6                   NaN
6   comment5  7         6           6

CodePudding user response:

You can create a recursive function to target the root id:

def ancestor(row):
    if df.loc[df['id'] == row['id']]['parent id'].values:
        new_dict = {k: [x for x in (vv for _, vv in v.items())][0]
                    for k, v in df.loc[df['id'] == row['parent id']].to_dict().items()}
        return ancestor(pd.Series(new_dict))
    return row['id']


df['ancestor'] = df.apply(lambda x: ancestor(x) if x['parent id'] else "", axis=1)

OUTPUT

  submission  id parent id ancestor
0      post1   1
1   comment1   2       1.0        1
2   comment2   3       1.0        1
3   comment3   4       2.0        1
4   comment4   5       4.0        1
5      post2   6
6   comment5   7       6.0        6
  • Related