Home > front end >  Dynamic Creation Of Pandas Column for Complex Structure with Array of Jsons
Dynamic Creation Of Pandas Column for Complex Structure with Array of Jsons

Time:08-04

TL:DR; Given an array of arrays of JSON inside a dict, I want to split each array of JSON Object into 2 Columns while re-using the already created columns for the previous array!

Hello,

I have an array of dicts in python which are a combination of YT Analytics Response. The array (after my processing) looks like that:

[{
 "Video ID": "XJasdj1231",
 "Video Title": "Hey, It is ME!",
 "Video Views": 21,110,
 "Video Tags": [{"abc":0},{"sad":0},{"happy":0}],
 "Video Search Tags": [{"jih": 9}, {"XA":3}]
 "Non Search Tags": [{"Messi":12}, {"Cr7":3}]
}, {
 "Video ID": "SADSAD",
 "Video Title": "Nope",
 "Video Views": 120,000,
 "Video Tags": [{"CSA":0},{"ASDF":0},{"GAS":0}, {"Liquid":0}, {"Informa":0}],
 "Video Search Tags": [{"jih": 9}, {"XA":3}, {"Xia":3}, {"OSAD":3}]
 "Non Search Tags": [{"Messi":12}, {"Cr7":3}, {"Mo":3}, {"Mera":3}, {"Micko":3}]
}]

So, each JSON object represents a unique video. As show, the first 3 attributes (Video ID, Video Title and views) are relatively easy to convert into PD Dataframe using the pd.DataFrame.from_dict(ARRAY) function.

PS: Used quick notations just to make it shorter

This function if used on the whole Array it will produce output like that:

ID           Title              Views    Tags          Search Tags    Non-search tags
 
XJasdj1231  Hey, It is ME!      XXXX    [[{"abc":0},     SAME LIKE VIDEO TAGS
                                       {"sad":0},
                                       {"happy":0}],

SADSAD        Nope              XXXX    SAME LIKE VIDEO TAGS    SAME LIKE VIDEO TAGS

As you see the whole array of JSON is pasted in the cell. What I would like to achieve is a dynamic creation of columns.

For each tag in "Video Tags": [{"abc":0},{"sad":0},{"happy":0}] I want to create the columns like that

Tag 1 Views 1 Tag 2 Views 2 Tag 3 Views 3 ......Tag X Views X
abc    0       sad    0      Happy  0

Each JSON object of the array is split into Tag {NUM} Views {SAME NUM}

The same for Video Search Tags & Non Search Tags With exception that I will make the column names

Search 1 Views 1 .... Search X Views X

Key thing here is that, not all videos have the same number of tags, some videos will have 3 or 30!

So, for example, if the first video had 30 tags and I created 30 columns for it with namings Tag 30 Views 30. And the second video had 10 tags only, I want it to be assigned in the first 10 Tags and 10 Views columns that were already made.

If the third video had 40 tags, I want it to re-use the first 30 and create an additional 10!

I thought about using PD.data_frame.serie but I stumbled upon the re-use columns feature.

Any ideas how can I achieve something like that?

Sorry for the long post!

Best Regards

CodePudding user response:

Sample Data:

df=pd.DataFrame([{
     "Video ID": "XJasdj1231",
     "Video Title": "Hey, It is ME!",
     "Video Views": 21110,
     "Video Tags": [{"abc":0},{"sad":0},{"happy":0}],
     "Video Search Tags": [{"jih": 9}, {"XA":3}],
     "Non Search Tags": [{"Messi":12}, {"Cr7":3}]
    }, {
     "Video ID": "SADSAD",
     "Video Title": "Nope",
     "Video Views": 120000,
     "Video Tags": [{"CSA":0},{"ASDF":0},{"GAS":0}, {"Liquid":0}, {"Informa":0}],
     "Video Search Tags": [{"jih": 9}, {"XA":3}, {"Xia":3}, {"OSAD":3}],
     "Non Search Tags": [{"Messi":12}, {"Cr7":3}, {"Mo":3}, {"Mera":3}, {"Micko":3}]
    }])
    df.head()

Solution:

  1. Generate empty dataframe with enough Tags, Views columns. For this purpose the length of all lists is read
  2. For each index of the dataframe get all Tags. By explode() pandas flatten the original list.
  3. Extract the information of each item of tags and write it to the dataframe

I was not succesfull with eliminating the for loops at the moment - thus there is probably a more elegant approach.

Code:

maxTags=max(df['Video Tags'].apply(lambda x: len(x)))
df_tags=pd.DataFrame(index=df.index, columns=["{0} {1}".format(title, num 1) for num in range (maxTags) for title in ['Tag', 'Views']])

for idx in df.index:
    tags=df['Video Tags'].explode().loc[idx]
    for i,tag in enumerate(tags):
        name=list(tag.keys())[0]
        df_tags.loc[idx, "Views {}".format(i 1)]=tag[name]
        df_tags.loc[idx, "Tag {}".format(i 1)]=name

CodePudding user response:

You can construct a df from your dict as you did. Then iterate through the rows and assign the columns dynamically. After finishing, remove the original column name from df. Please check the code below:

dict = [{
         "Video ID": "XJasdj1231",
         "Video Title": "Hey, It is ME!",
         "Video Views": 21110,
         "Video Tags": [{"abc":0},{"sad":0},{"happy":0}],
         "Video Search Tags": [{"jih": 9}, {"XA":3}],
         "Non Search Tags": [{"Messi":12}, {"Cr7":3}]
        }, {
         "Video ID": "SADSAD",
         "Video Title": "Nope",
         "Video Views": 120000,
         "Video Tags": [{"CSA":0},{"ASDF":0},{"GAS":0}, {"Liquid":0}, {"Informa":0}],
         "Video Search Tags": [{"jih": 9}, {"XA":3}, {"Xia":3}, {"OSAD":3}],
         "Non Search Tags": [{"Messi":12}, {"Cr7":3}, {"Mo":3}, {"Mera":3}, {"Micko":3}]
        }]

df = pd.DataFrame(dict)

def dynamic_cols(df, col, tag_nm):
    for index, row in df.iterrows():
        for i, tag_dict in enumerate(row[col], 1):
            for tag in tag_dict:
                views = tag_dict[tag]

                # define column names
                tag_colnm = f"{tag_nm} { i }"
                views_colnm = f"Views { i }"

                # set tag and views columns
                df.at[index, tag_colnm] = tag
                df.at[index, views_colnm] = views

    # remove column from df
    df = df[df.columns.difference([col])]

    return df

df = dynamic_cols(df, "Video Tags", "Tag")
df = dynamic_cols(df, "Video Search Tags", "Search")
df = dynamic_cols(df, "Non Search Tags", "Search")

Result:

        Search 1    Search 2    Search 3    Search 4    Search 5    Tag 1    Tag 2    Tag 3    Tag 4    Tag 5    Video ID    Video Title       Video Views    Views 1    Views 2    Views 3    Views 4    Views 5
--  ----------  ----------  ----------  ----------  ----------  -------  -------  -------  -------  -------  ----------  --------------  -------------  ---------  ---------  ---------  ---------  ---------
0  Messi       Cr7         nan         nan         nan         abc      sad      happy    nan      nan      XJasdj1231  Hey, It is ME!          21110         12          3          0        nan        nan
1  Messi       Cr7         Mo          Mera        Micko       CSA      ASDF     GAS      Liquid   Informa  SADSAD      Nope                   120000         12          3          3          3          3

EDIT 1: added Views columns dynamic col. creation in function for the sake of reusability.

  • Related