I have the following telegram export JSON dataset:
import pandas as pd
df = pd.read_json("data/result.json")
>>>df.colums
Index(['name', 'type', 'id', 'messages'], dtype='object')
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
# Sample output
sample_df = pd.DataFrame({"messages": [
{"id": 11, "from": "user3984", "text": "Do you like soccer?"},
{"id": 312, "from": "user837", "text": ['Not sure', {'type': 'hashtag', 'text': '#confused'}]},
{"id": 4324, "from": "user3984", "text": ['O ', {'type': 'mention', 'text': '@user87324'}, ' really?']}
]})
Within df
, there's a "messages" column, which has the following output:
>>> df["messages"]
0 {'id': -999713937, 'type': 'service', 'date': ...
1 {'id': -999713936, 'type': 'service', 'date': ...
2 {'id': -999713935, 'type': 'message', 'date': ...
3 {'id': -999713934, 'type': 'message', 'date': ...
4 {'id': -999713933, 'type': 'message', 'date': ...
...
22377 {'id': 22102, 'type': 'message', 'date': '2022...
22378 {'id': 22103, 'type': 'message', 'date': '2022...
22379 {'id': 22104, 'type': 'message', 'date': '2022...
22380 {'id': 22105, 'type': 'message', 'date': '2022...
22381 {'id': 22106, 'type': 'message', 'date': '2022...
Name: messages, Length: 22382, dtype: object
Within messages, there's a particular key named "text", and that's the place I want to focus. Turns out when you explore the data, text column can have:
A single text:
>>> df["messages"][5]["text"]
'JAJAJAJAJAJAJA'
>>> df["messages"][22262]["text"]
'No creo'
But sometimes it's nested. Like the following:
>>> df["messages"][22373]["text"]
['O ', {'type': 'mention', 'text': '@user87324'}, ' really?']
>>> df["messages"][22189]["text"]
['The average married couple has sex roughly once a week. ', {'type': 'mention', 'text': '@googlefactss'}, ' ', {'type': 'hashtag', 'text': '#funfact'}]
>>> df["messages"][22345]["text"]
[{'type': 'mention', 'text': '@user817430'}]
In case for nested data, if I want to grab the main text, I can do the following:
>>> df["messages"][22373]["text"][0]
'O '
>>> df["messages"][22189]["text"][0]
'The average married couple has sex roughly once a week. '
>>>
From here, everything seems ok. However, the problem arrives when I do the for loop. If I try the following:
for item in df["messages"]:
tg_id = item.get("id", "None")
tg_type = item.get("type", "None")
tg_date = item.get("date", "None")
tg_from = item.get("from", "None")
tg_text = item.get("text", "None")
print(tg_id, tg_from, tg_text)
A sample output is:
21263 user3984 jajajajaja
21264 user837 ['Not sure', {'type': 'hashtag', 'text': '#confused'}]
21265 user3984 What time is it?✋
MY ASK: How to flatten the rows? I need the following (and store that in a data frame):
21263 user3984 jajajajaja
21264 user837 Not sure
21265 user837 type: hashtag
21266 user837 text: #confused
21267 user3984 What time is it?✋
I tried to detect "text" type like this:
for item in df["messages"]:
tg_id = item.get("id", "None")
tg_type = item.get("type", "None")
tg_date = item.get("date", "None")
tg_from = item.get("from", "None")
tg_text = item.get("text", "None")
if type(tg_text) == list:
tg_text = tg_text[0]
print(tg_id, tg_from, tg_text)
With this I only grab the first text, but I'm expecting to grab the other fields as well or to 'flatten' the data.
I also tried:
for item in df["messages"]:
tg_id = item.get("id", "None")
tg_type = item.get("type", "None")
tg_date = item.get("date", "None")
tg_from = item.get("from", "None")
tg_text = item.get("text", "None")
if type(tg_text) == list:
tg_text = tg_text[0]
tg_second = tg_text[1]["text"]
print(tg_id, tg_from, tg_text, tg_second)
But no luck because indices are variable, length from messages are variable too.
In addition, even if the output weren't close of my desired solution, I also tried:
for item in df["messages"]:
tg_text = item.get("text", "None")
if type(tg_text) == list:
for i in tg_text:
print(item, i)
mydict = {}
for k, v in df.items():
print(k, v)
mydict[k] = v
# Used df["text"].explode()
# Used json_normalize but no luck
Any thoughts?
CodePudding user response:
Assuming a dataframe like the following:
df = pd.DataFrame({"messages": [
{"id": 21263, "from": "user3984", "text": "jajajajaja"},
{"id": 21264, "from": "user837", "text": ['Not sure', {'type': 'hashtag', 'text': '#confused'}]},
{"id": 21265, "from": "user3984", "text": ['O ', {'type': 'mention', 'text': '@user87324'}, ' really?']}
]})
First, expand the messages
dictionaries into separate id
, from
and text
columns.
expanded = pd.concat([df.drop("messages", axis=1), pd.json_normalize(df["messages"])], axis=1)
Then explode the dataframe to have a row for each entry in text
:
exploded = expanded.explode("text")
Then expand the dictionaries that are in some of the entries, converting them to lists of text:
def convert_dict(entry):
if type(entry) is dict:
return [f"{k}: {v}" for k, v in entry.items()]
else:
return entry
exploded["text"] = exploded["text"].apply(convert_dict)
Finally, explode again to separate the converted dicts to separate rows.
final = exploded.explode("text")
The resulting output should look like this
id from text
0 21263 user3984 jajajajaja
1 21264 user837 Not sure
1 21264 user837 type: hashtag
1 21264 user837 text: #confused
2 21265 user3984 O
2 21265 user3984 type: mention
2 21265 user3984 text: @user87324
2 21265 user3984 really?
CodePudding user response:
Just to share some ideas to flatten your list,
def flatlist(srclist):
flatlist=[]
if srclist: #check if srclist is not None
for item in srclist:
if(type(item) == str): #check if item is type of string
flatlist.append(item)
if(type(item) == dict): #check if item is type of dict
for x in item:
flatlist.append(x ' ' item[x]) #combine key and value
return flatlist
for item in df["messages"]:
tg_text = item.get("text", "None")
flat_list = flatlist(tg_text) # get the flattened list
for tg in flat_list: # loop through the list and get the data you want
tg_id = item.get("id", "None")
tg_from = item.get("from", "None")
print(tg_id, tg_from, tg)