Grab specific strings within a for loop with variable nested length-CodePudding

I have the following telegram export JSON dataset:

import pandas as pd

df = pd.read_json("data/result.json")

>>>df.colums
Index(['name', 'type', 'id', 'messages'], dtype='object')

>>> type(df)
<class 'pandas.core.frame.DataFrame'>

# Sample output
sample_df = pd.DataFrame({"messages": [
    {"id": 11, "from": "user3984", "text": "Do you like soccer?"},
    {"id": 312, "from": "user837", "text": ['Not sure', {'type': 'hashtag', 'text': '#confused'}]}, 
    {"id": 4324, "from": "user3984", "text": ['O ', {'type': 'mention', 'text': '@user87324'}, ' really?']}
]})

Within df, there's a "messages" column, which has the following output:

>>> df["messages"]
0        {'id': -999713937, 'type': 'service', 'date': ...
1        {'id': -999713936, 'type': 'service', 'date': ...
2        {'id': -999713935, 'type': 'message', 'date': ...
3        {'id': -999713934, 'type': 'message', 'date': ...
4        {'id': -999713933, 'type': 'message', 'date': ...
                               ...                        
22377    {'id': 22102, 'type': 'message', 'date': '2022...
22378    {'id': 22103, 'type': 'message', 'date': '2022...
22379    {'id': 22104, 'type': 'message', 'date': '2022...
22380    {'id': 22105, 'type': 'message', 'date': '2022...
22381    {'id': 22106, 'type': 'message', 'date': '2022...
Name: messages, Length: 22382, dtype: object

Within messages, there's a particular key named "text", and that's the place I want to focus. Turns out when you explore the data, text column can have:

A single text:

>>> df["messages"][5]["text"]
'JAJAJAJAJAJAJA'
>>> df["messages"][22262]["text"]
'No creo'

But sometimes it's nested. Like the following:

>>> df["messages"][22373]["text"]
['O ', {'type': 'mention', 'text': '@user87324'}, ' really?']
>>> df["messages"][22189]["text"]
['The average married couple has sex roughly once a week. ', {'type': 'mention', 'text': '@googlefactss'}, ' ', {'type': 'hashtag', 'text': '#funfact'}]
>>> df["messages"][22345]["text"]
[{'type': 'mention', 'text': '@user817430'}]

In case for nested data, if I want to grab the main text, I can do the following:

>>> df["messages"][22373]["text"][0]
'O '
>>> df["messages"][22189]["text"][0]
'The average married couple has sex roughly once a week. '
>>>

From here, everything seems ok. However, the problem arrives when I do the for loop. If I try the following:

for item in df["messages"]:
    tg_id = item.get("id", "None")
    tg_type = item.get("type", "None")
    tg_date = item.get("date", "None")
    tg_from = item.get("from", "None")
    tg_text = item.get("text", "None")
        
    print(tg_id, tg_from, tg_text)

A sample output is:

21263 user3984 jajajajaja
21264 user837 ['Not sure', {'type': 'hashtag', 'text': '#confused'}]
21265 user3984 What time is it?✋

MY ASK: How to flatten the rows? I need the following (and store that in a data frame):

21263 user3984 jajajajaja
21264 user837 Not sure
21265 user837 type: hashtag
21266 user837 text: #confused
21267 user3984 What time is it?✋

I tried to detect "text" type like this:

for item in df["messages"]:
    tg_id = item.get("id", "None")
    tg_type = item.get("type", "None")
    tg_date = item.get("date", "None")
    tg_from = item.get("from", "None")
    tg_text = item.get("text", "None")

    if type(tg_text) == list:
        tg_text = tg_text[0]
    
    print(tg_id, tg_from, tg_text)

With this I only grab the first text, but I'm expecting to grab the other fields as well or to 'flatten' the data.

I also tried:

for item in df["messages"]:
    tg_id = item.get("id", "None")
    tg_type = item.get("type", "None")
    tg_date = item.get("date", "None")
    tg_from = item.get("from", "None")
    tg_text = item.get("text", "None")

    if type(tg_text) == list:
        tg_text = tg_text[0]
        tg_second = tg_text[1]["text"]
    
    print(tg_id, tg_from, tg_text, tg_second)

But no luck because indices are variable, length from messages are variable too.

In addition, even if the output weren't close of my desired solution, I also tried:

for item in df["messages"]:
    tg_text = item.get("text", "None")

    if type(tg_text) == list:
        for i in tg_text:
            print(item, i)

mydict = {}
for k, v in df.items():
    print(k, v)
    mydict[k] = v

# Used df["text"].explode()
# Used json_normalize but no luck

Any thoughts?

CodePudding user response：

Assuming a dataframe like the following:

df = pd.DataFrame({"messages": [
    {"id": 21263, "from": "user3984", "text": "jajajajaja"},
    {"id": 21264, "from": "user837", "text": ['Not sure', {'type': 'hashtag', 'text': '#confused'}]}, 
    {"id": 21265, "from": "user3984", "text": ['O ', {'type': 'mention', 'text': '@user87324'}, ' really?']}
]})

First, expand the messages dictionaries into separate id, from and text columns.

 expanded = pd.concat([df.drop("messages", axis=1), pd.json_normalize(df["messages"])], axis=1)

Then explode the dataframe to have a row for each entry in text:

exploded = expanded.explode("text")

Then expand the dictionaries that are in some of the entries, converting them to lists of text:

def convert_dict(entry):
    if type(entry) is dict:
        return [f"{k}: {v}" for k, v in entry.items()]
    else:
        return entry

exploded["text"] = exploded["text"].apply(convert_dict)

Finally, explode again to separate the converted dicts to separate rows.

final = exploded.explode("text")

The resulting output should look like this

      id      from              text
0  21263  user3984        jajajajaja
1  21264   user837          Not sure
1  21264   user837     type: hashtag
1  21264   user837   text: #confused
2  21265  user3984                O 
2  21265  user3984     type: mention
2  21265  user3984  text: @user87324
2  21265  user3984           really?

CodePudding user response：

Just to share some ideas to flatten your list,

def flatlist(srclist):
    flatlist=[]
    if srclist: #check if srclist is not None
        for item in srclist:
            if(type(item) == str): #check if item is type of string
                flatlist.append(item)
            if(type(item) == dict): #check if item is type of dict
                for x in item:
                    flatlist.append(x   ' '   item[x]) #combine key and value
    return flatlist

for item in df["messages"]:
    tg_text = item.get("text", "None")
    flat_list = flatlist(tg_text) # get the flattened list
    for tg in flat_list: # loop through the list and get the data you want
        tg_id = item.get("id", "None")
        tg_from = item.get("from", "None")
    
        print(tg_id, tg_from, tg)