How to get threads/conversations from reply ids (Python)?-CodePudding

I'm a relative newbie with python and I'm trying to reconstruct conversations/threads from a dataframe with a list of IDs.

I currently have a pandas dataframe of tweets / reddit posts which have roughly the following format:

id	text	parent_id	replies
id1	blah blah	_ post _	id2, id3, id4, id5, id6, id7
id2	blah blah	id1	id4, id5, id6, id7
id3	blah blah	id1
id4	blah blah	id2	id6, id7
id5	blah blah	id2
id6	blah blah	id4	id7
id7	blah blah	id6

My goal is to separate the data into threads/conversations based on the ids. This would mean, from the above example, getting the following sequences as the output:

[id1, id2, id4, id6],

[id1, id2, id4, id7],

[id1, id2, id5], &

[id1, id3].

Having these lists would then enable me to look at threads in their entirety. Currently my code is very convoluted and looks something like this:

out_list = []
for i, row in df.iterrows():
    id_ = row["id"]
    # create our output file 
    sequence = [id_]
    replies = list(row['replies'])
    # creates a new dataframe from the replies to the topline comment in question
    reply_df= df.loc[df['id'].isin(replies)]
    reply_df = reply_df[reply_df.Parent_id2 == id_]
    #check if ends at topline
    if reply_df.empty == False:
        
        def turn_recursion(df, reply_df):
            for j, row_ in reply_df.iterrows():
                replies_2 = reply_df.loc[j, 'replies']
                id_2 = row_["id"]

                reply_df2 =  df.loc[df['id'].isin(replies_2)]
                reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]

                nonlocal sequence
                nonlocal out_list
                            
                if reply_df2.empty == False:
                    sequence.append(id_2)
                    return(turn_recursion(df, reply_df2))
                
                else:
                    sequence.append(id_2)
                    out_list.append(sequence)
        
        turn_recursion(test2, reply_df)
    else:
        out_list.append(sequence)

This is currently giving me semi-accurate results but instead of getting: [[id1, id2, id4, id6],[id1, id2, id4, id7]], I get: [id1, id2, id4, id6, id7].

I realise I'm probably being a bit dim and that there is a simple solution, but for the life of me, I can't seem to figure out a way of doing this so that it works properly and for any thread length.

Thank you in advance for any suggestions. :)

CodePudding user response：

Use networkx to achieve what you want:

import pandas as pd
import networkx as nx
from collections import defaultdict

data = defaultdict(list)

# Build graph from pandas
G = nx.from_pandas_edgelist(df, source='parent_id', target='id', 
                            create_using=nx.DiGraph)

# Find leaves (id3, id5, id7)
leaves = [node for node, degree in G.out_degree() if degree == 0]

# Enumerate all possible paths
for node in df['id']:
    for leaf in leaves:
        for path in nx.all_simple_paths(G, node, leaf):
            data[node].append(path)

Output:

>>> data
defaultdict(list,
            {'id1': [['id1', 'id3'],
              ['id1', 'id2', 'id5'],
              ['id1', 'id2', 'id4', 'id6', 'id7']],
             'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
             'id4': [['id4', 'id6', 'id7']],
             'id6': [['id6', 'id7']]})

If you want to merge the dictionary to your dataframe:

df['replies'] = df['id'].map(data)
print(df)

# Output:
    id       text parent_id                                            replies
0  id1  blah blah  _ post _  [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
1  id2  blah blah       id1                 [[id2, id5], [id2, id4, id6, id7]]
2  id3  blah blah       id1                                                 []
3  id4  blah blah       id2                                  [[id4, id6, id7]]
4  id5  blah blah       id2                                                 []
5  id6  blah blah       id4                                       [[id6, id7]]
6  id7  blah blah       id6                                                 []

Now you can explode your dataframe:

df = df.explode('replies')
print(df)

# Output:
    id       text parent_id                    replies
0  id1  blah blah  _ post _                 [id1, id3]
0  id1  blah blah  _ post _            [id1, id2, id5]
0  id1  blah blah  _ post _  [id1, id2, id4, id6, id7]
1  id2  blah blah       id1                 [id2, id5]
1  id2  blah blah       id1       [id2, id4, id6, id7]
2  id3  blah blah       id1                        NaN
3  id4  blah blah       id2            [id4, id6, id7]
4  id5  blah blah       id2                        NaN
5  id6  blah blah       id4                 [id6, id7]
6  id7  blah blah       id6                        NaN