I'm a relative newbie with python and I'm trying to reconstruct conversations/threads from a dataframe with a list of IDs.
I currently have a pandas dataframe of tweets / reddit posts which have roughly the following format:
id | text | parent_id | replies |
---|---|---|---|
id1 | blah blah | _ post _ | id2, id3, id4, id5, id6, id7 |
id2 | blah blah | id1 | id4, id5, id6, id7 |
id3 | blah blah | id1 | |
id4 | blah blah | id2 | id6, id7 |
id5 | blah blah | id2 | |
id6 | blah blah | id4 | id7 |
id7 | blah blah | id6 |
My goal is to separate the data into threads/conversations based on the ids. This would mean, from the above example, getting the following sequences as the output:
[id1, id2, id4, id6],
[id1, id2, id4, id7],
[id1, id2, id5], &
[id1, id3].
Having these lists would then enable me to look at threads in their entirety. Currently my code is very convoluted and looks something like this:
out_list = []
for i, row in df.iterrows():
id_ = row["id"]
# create our output file
sequence = [id_]
replies = list(row['replies'])
# creates a new dataframe from the replies to the topline comment in question
reply_df= df.loc[df['id'].isin(replies)]
reply_df = reply_df[reply_df.Parent_id2 == id_]
#check if ends at topline
if reply_df.empty == False:
def turn_recursion(df, reply_df):
for j, row_ in reply_df.iterrows():
replies_2 = reply_df.loc[j, 'replies']
id_2 = row_["id"]
reply_df2 = df.loc[df['id'].isin(replies_2)]
reply_df2 = reply_df2[reply_df2.Parent_id2 == id_2]
nonlocal sequence
nonlocal out_list
if reply_df2.empty == False:
sequence.append(id_2)
return(turn_recursion(df, reply_df2))
else:
sequence.append(id_2)
out_list.append(sequence)
turn_recursion(test2, reply_df)
else:
out_list.append(sequence)
This is currently giving me semi-accurate results but instead of getting: [[id1, id2, id4, id6],[id1, id2, id4, id7]], I get: [id1, id2, id4, id6, id7].
I realise I'm probably being a bit dim and that there is a simple solution, but for the life of me, I can't seem to figure out a way of doing this so that it works properly and for any thread length.
Thank you in advance for any suggestions. :)
CodePudding user response:
Use networkx
to achieve what you want:
import pandas as pd
import networkx as nx
from collections import defaultdict
data = defaultdict(list)
# Build graph from pandas
G = nx.from_pandas_edgelist(df, source='parent_id', target='id',
create_using=nx.DiGraph)
# Find leaves (id3, id5, id7)
leaves = [node for node, degree in G.out_degree() if degree == 0]
# Enumerate all possible paths
for node in df['id']:
for leaf in leaves:
for path in nx.all_simple_paths(G, node, leaf):
data[node].append(path)
Output:
>>> data
defaultdict(list,
{'id1': [['id1', 'id3'],
['id1', 'id2', 'id5'],
['id1', 'id2', 'id4', 'id6', 'id7']],
'id2': [['id2', 'id5'], ['id2', 'id4', 'id6', 'id7']],
'id4': [['id4', 'id6', 'id7']],
'id6': [['id6', 'id7']]})
If you want to merge the dictionary to your dataframe:
df['replies'] = df['id'].map(data)
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [[id1, id3], [id1, id2, id5], [id1, id2, id4, ...
1 id2 blah blah id1 [[id2, id5], [id2, id4, id6, id7]]
2 id3 blah blah id1 []
3 id4 blah blah id2 [[id4, id6, id7]]
4 id5 blah blah id2 []
5 id6 blah blah id4 [[id6, id7]]
6 id7 blah blah id6 []
Now you can explode your dataframe:
df = df.explode('replies')
print(df)
# Output:
id text parent_id replies
0 id1 blah blah _ post _ [id1, id3]
0 id1 blah blah _ post _ [id1, id2, id5]
0 id1 blah blah _ post _ [id1, id2, id4, id6, id7]
1 id2 blah blah id1 [id2, id5]
1 id2 blah blah id1 [id2, id4, id6, id7]
2 id3 blah blah id1 NaN
3 id4 blah blah id2 [id4, id6, id7]
4 id5 blah blah id2 NaN
5 id6 blah blah id4 [id6, id7]
6 id7 blah blah id6 NaN