TL:DR; Given an array of arrays of JSON inside a dict, I want to split each array of JSON Object into 2 Columns while re-using the already created columns for the previous array!
Hello,
I have an array of dicts in python which are a combination of YT Analytics Response. The array (after my processing) looks like that:
[{
"Video ID": "XJasdj1231",
"Video Title": "Hey, It is ME!",
"Video Views": 21,110,
"Video Tags": [{"abc":0},{"sad":0},{"happy":0}],
"Video Search Tags": [{"jih": 9}, {"XA":3}]
"Non Search Tags": [{"Messi":12}, {"Cr7":3}]
}, {
"Video ID": "SADSAD",
"Video Title": "Nope",
"Video Views": 120,000,
"Video Tags": [{"CSA":0},{"ASDF":0},{"GAS":0}, {"Liquid":0}, {"Informa":0}],
"Video Search Tags": [{"jih": 9}, {"XA":3}, {"Xia":3}, {"OSAD":3}]
"Non Search Tags": [{"Messi":12}, {"Cr7":3}, {"Mo":3}, {"Mera":3}, {"Micko":3}]
}]
So, each JSON object represents a unique video. As show, the first 3 attributes (Video ID, Video Title and views) are relatively easy to convert into PD Dataframe using the pd.DataFrame.from_dict(ARRAY)
function.
PS: Used quick notations just to make it shorter
This function if used on the whole Array it will produce output like that:
ID Title Views Tags Search Tags Non-search tags
XJasdj1231 Hey, It is ME! XXXX [[{"abc":0}, SAME LIKE VIDEO TAGS
{"sad":0},
{"happy":0}],
SADSAD Nope XXXX SAME LIKE VIDEO TAGS SAME LIKE VIDEO TAGS
As you see the whole array of JSON is pasted in the cell. What I would like to achieve is a dynamic creation of columns.
For each tag in "Video Tags": [{"abc":0},{"sad":0},{"happy":0}]
I want to create the columns like that
Tag 1 Views 1 Tag 2 Views 2 Tag 3 Views 3 ......Tag X Views X
abc 0 sad 0 Happy 0
Each JSON object of the array is split into Tag {NUM} Views {SAME NUM}
The same for Video Search Tags & Non Search Tags
With exception that I will make the column names
Search 1 Views 1 .... Search X Views X
Key thing here is that, not all videos have the same number of tags, some videos will have 3 or 30!
So, for example, if the first video had 30 tags and I created 30 columns for it with namings Tag 30 Views 30. And the second video had 10 tags only, I want it to be assigned in the first 10 Tags and 10 Views columns that were already made.
If the third video had 40 tags, I want it to re-use the first 30 and create an additional 10!
I thought about using PD.data_frame.serie
but I stumbled upon the re-use columns feature.
Any ideas how can I achieve something like that?
Sorry for the long post!
Best Regards
CodePudding user response:
Sample Data:
df=pd.DataFrame([{
"Video ID": "XJasdj1231",
"Video Title": "Hey, It is ME!",
"Video Views": 21110,
"Video Tags": [{"abc":0},{"sad":0},{"happy":0}],
"Video Search Tags": [{"jih": 9}, {"XA":3}],
"Non Search Tags": [{"Messi":12}, {"Cr7":3}]
}, {
"Video ID": "SADSAD",
"Video Title": "Nope",
"Video Views": 120000,
"Video Tags": [{"CSA":0},{"ASDF":0},{"GAS":0}, {"Liquid":0}, {"Informa":0}],
"Video Search Tags": [{"jih": 9}, {"XA":3}, {"Xia":3}, {"OSAD":3}],
"Non Search Tags": [{"Messi":12}, {"Cr7":3}, {"Mo":3}, {"Mera":3}, {"Micko":3}]
}])
df.head()
Solution:
- Generate empty dataframe with enough
Tags, Views
columns. For this purpose the length of all lists is read - For each index of the dataframe get all Tags. By
explode()
pandas flatten the original list. - Extract the information of each item of tags and write it to the dataframe
I was not succesfull with eliminating the for loops at the moment - thus there is probably a more elegant approach.
Code:
maxTags=max(df['Video Tags'].apply(lambda x: len(x)))
df_tags=pd.DataFrame(index=df.index, columns=["{0} {1}".format(title, num 1) for num in range (maxTags) for title in ['Tag', 'Views']])
for idx in df.index:
tags=df['Video Tags'].explode().loc[idx]
for i,tag in enumerate(tags):
name=list(tag.keys())[0]
df_tags.loc[idx, "Views {}".format(i 1)]=tag[name]
df_tags.loc[idx, "Tag {}".format(i 1)]=name
CodePudding user response:
You can construct a df
from your dict
as you did. Then iterate through the rows and assign the columns dynamically. After finishing, remove the original column name from df
. Please check the code below:
dict = [{
"Video ID": "XJasdj1231",
"Video Title": "Hey, It is ME!",
"Video Views": 21110,
"Video Tags": [{"abc":0},{"sad":0},{"happy":0}],
"Video Search Tags": [{"jih": 9}, {"XA":3}],
"Non Search Tags": [{"Messi":12}, {"Cr7":3}]
}, {
"Video ID": "SADSAD",
"Video Title": "Nope",
"Video Views": 120000,
"Video Tags": [{"CSA":0},{"ASDF":0},{"GAS":0}, {"Liquid":0}, {"Informa":0}],
"Video Search Tags": [{"jih": 9}, {"XA":3}, {"Xia":3}, {"OSAD":3}],
"Non Search Tags": [{"Messi":12}, {"Cr7":3}, {"Mo":3}, {"Mera":3}, {"Micko":3}]
}]
df = pd.DataFrame(dict)
def dynamic_cols(df, col, tag_nm):
for index, row in df.iterrows():
for i, tag_dict in enumerate(row[col], 1):
for tag in tag_dict:
views = tag_dict[tag]
# define column names
tag_colnm = f"{tag_nm} { i }"
views_colnm = f"Views { i }"
# set tag and views columns
df.at[index, tag_colnm] = tag
df.at[index, views_colnm] = views
# remove column from df
df = df[df.columns.difference([col])]
return df
df = dynamic_cols(df, "Video Tags", "Tag")
df = dynamic_cols(df, "Video Search Tags", "Search")
df = dynamic_cols(df, "Non Search Tags", "Search")
Result:
Search 1 Search 2 Search 3 Search 4 Search 5 Tag 1 Tag 2 Tag 3 Tag 4 Tag 5 Video ID Video Title Video Views Views 1 Views 2 Views 3 Views 4 Views 5
-- ---------- ---------- ---------- ---------- ---------- ------- ------- ------- ------- ------- ---------- -------------- ------------- --------- --------- --------- --------- ---------
0 Messi Cr7 nan nan nan abc sad happy nan nan XJasdj1231 Hey, It is ME! 21110 12 3 0 nan nan
1 Messi Cr7 Mo Mera Micko CSA ASDF GAS Liquid Informa SADSAD Nope 120000 12 3 3 3 3
EDIT 1: added Views columns dynamic col. creation in function for the sake of reusability.