Normalizing Twitter JSON with Python-CodePudding

Relevant packages:

import pandas as pd
import json

I am struggling to normalize some JSON output from the Twitter API. Everything is being done with Python. I am using the pd.json_normalize function to try to normalize the JSON output below, and I have attempted many methods. The problem I seem to be running into is that the referenced tweets portion of the response does not show up in every response (presumably because there are no referenced Tweets for that response). I have not seen this situation covered on any forums or articles online, so hoping that we can get a good solution here for myself and anyone else running into this issue to reference.

The code I have been using to try to normalize looks like

df_test=pd.json_normalize(json_response, 
                          record_path='data',
                          errors='ignore'
                          )

and I have tinkered around with the parameters a lot but cannot get it to work. Hoping that a wiser coder can help me solve this issue.

Here is the JSON:

{
    "data": [
        {
            "author_id": "2244994945",
            "created_at": "2022-08-12T17:00:25.000Z",
            "id": "1558136333867970560",
            "public_metrics": {
                "like_count": 11,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 7
            },
            "referenced_tweets": [
                {
                    "id": "1558084759812710401",
                    "type": "quoted"
                }
            ],
            "text": "\ud83d\udce3There\u2019s still time to register for #Chirp!\u23f"
        },
        {
            "author_id": "2244994945",
            "created_at": "2022-08-11T22:04:00.000Z",
            "id": "1557850345392377856",
            "public_metrics": {
                "like_count": 18,
                "quote_count": 0,
                "reply_count": 2,
                "retweet_count": 5
            },
            "text": "Discover how @ListenFirst, a social analytics platform and Twitter @OfficialPartner, improved their efficiency and accuracy by leveraging the v2 API. \ud83d\udca1\ud83d\udcbb\n\nRead it here \u2b07\ufe0f \"
        },
        {
            "author_id": "2244994945",
            "created_at": "2022-08-11T21:35:15.000Z",
            "id": "1557843109681123328",
            "public_metrics": {
                "like_count": 14,
                "quote_count": 0,
                "reply_count": 1,
                "retweet_count": 3
            },
            "referenced_tweets": [
                {
                    "id": "1553987008355254272",
                    "type": "quoted"
                }
            ],
            "text": "\ud83d\ude80\ud83e\udd29\u2728"
        },
        {
            "author_id": "2244994945",
            "created_at": "2022-08-11T21:25:01.000Z",
            "id": "1557840534252662784",
            "public_metrics": {
                "like_count": 1,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "referenced_tweets": [
                {
                    "id": "1557737011041038337",
                    "type": "replied_to"
                }
            ],
            "text": "@talkshrey @talkshrey this is awesome, we can\u2019t wait to see what you build! \ud83d\ude80"
        },
        {
            "author_id": "2244994945",
            "created_at": "2022-08-11T18:00:09.000Z",
            "id": "1557788977842765827",
            "public_metrics": {
                "like_count": 18,
                "quote_count": 0,
                "reply_count": 2,
                "retweet_count": 8
            },
            "text": "\ud83d\udce3 There is only ONE DAY LEFT to register to attend the #Chirp Developer Conference in person! \n\nGet your applications in by 11:59pm PT on Friday, August 12, so you don\u2019t miss out on your chance to attend. \ud83d\udcc5\ud83d\udc40\n\nRegister here \u2935\ufe0f "
        },
        {
            "author_id": "1555310752176668675",
            "created_at": "2022-08-11T16:30:53.000Z",
            "id": "1557766512915742722",
            "public_metrics": {
                "like_count": 0,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "yo! what's wrong with the bird app today??\ni can't follow, neither be followed!!?\n#twitter #TwitterDown #twitterdev"
        },
        {
            "author_id": "1550706584204021762",
            "created_at": "2022-08-11T05:59:09.000Z",
            "id": "1557607534596431873",
            "public_metrics": {
                "like_count": 1,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "referenced_tweets": [
                {
                    "id": "1539383247355793408",
                    "type": "replied_to"
                }
            ],
            "text": "@fake_Awareness @Eminem I was thinking about enter #TwitterDev #Chirp and writing a application for #HackAThon that would find every scammers account by AI algorithms."
        },
        {
            "author_id": "1512404349598957573",
            "created_at": "2022-08-10T18:12:36.000Z",
            "id": "1557429723357401088",
            "public_metrics": {
                "like_count": 0,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "#TwitterDev"
        },
        {
            "author_id": "1389763639159230464",
            "created_at": "2022-08-10T15:22:48.000Z",
            "id": "1557386991607627776",
            "public_metrics": {
                "like_count": 0,
                "quote_count": 0,
                "reply_count": 0,
                "retweet_count": 0
            },
            "text": "#twitterdeveloper #twitterdev #twitterapi\nwhy can i only make sense of it doesn't drast\u2026."
        },
        {
            "author_id": "2244994945",
            "created_at": "2022-08-09T17:58:58.000Z",
            "id": "1557063904655654912",
            "public_metrics": {
                "like_count": 11,
                "quote_count": 0,
                "reply_count": 1,
                "retweet_count": 4
            },
            "referenced_tweets": [
                {
                    "id": "1557063868551114752",
                    "type": "replied_to"
                }
            ],
            "text": "\u23f3 Time is running out to enter the #ChirpDevChallenge!\n\nSubmit your app by August 19th with an app in one of our three categories and help people unlock brand new experiences on Twitter. \ud83d\udcbb\u2699\ufe0f\n\nJoin the challenge \u2b07"
        }
    ],
    "meta": {
        "newest_id": "1558136333867970560",
        "oldest_id": "1557063904655654912",
        "result_count": 10
    }
}

CodePudding user response：

If dct is your dictionary from the question you can do:

df = pd.DataFrame(dct["data"]).explode("referenced_tweets")
df["referenced_id"] = np.where(
    df["referenced_tweets"].notna(), df["referenced_tweets"].str["id"], ""
)

df["referenced_type"] = np.where(
    df["referenced_tweets"].notna(), df["referenced_tweets"].str["type"], ""
)

df.pop("referenced_tweets")

df = pd.concat([df, df.pop("public_metrics").apply(pd.Series)], axis=1)

print(df)

Prints:

             author_id                created_at                   id                                                                                                                                                                                                                                    text        referenced_id referenced_type  like_count  quote_count  reply_count  retweet_count
0           2244994945  2022-08-12T17:00:25.000Z  1558136333867970560