I am quite new to all this, I took a short Python bootcamp a while back and am now struggling to get some Instagram data into a format I understand.
Using the following code:
# Importing packages
import json
import re
import collections
import pandas as pd
import datetime as dt
import matplotlib.pyplot as plt
import seaborn as sns
sns.set()
# Loading downloaded instagram data
json_data = {}
data_path = "C:/Users/etc.json"
with open(data_path) as file:
json_data = json.load(file)
print(json_data)
I get the following output which looks promising:
{'relationships_followers': [{'title': '', 'media_list_data': [], 'string_list_data': [{'href': 'https://www.instagram.com/username1', 'value': 'username1', 'timestamp': 1655411505}]}, {'title': '', 'media_list_data': [], 'string_list_data': [{'href': 'https://www.instagram.com/username2', 'value': 'username2', 'timestamp': 1655149264}]}, {'title': '', 'media_list_data': [], 'string_list_data': [{'href': 'https://www.instagram.com/username3', 'value': 'username3', 'timestamp': 1655129904}]}, etc.....
type = dict
But when I try to convert it into a pandas dataframe it presents strangely
dfp = pd.read_json(data_path, orient = 'records')
print(dfp)
print(type(dfp))
Output:
relationships_followers
0 {'title': '', 'media_list_data': [], 'string_l...
1 {'title': '', 'media_list_data': [], 'string_l...
2 {'title': '', 'media_list_data': [], 'string_l...
3 {'title': '', 'media_list_data': [], 'string_l...
4 {'title': '', 'media_list_data': [], 'string_l...
.. ...
575 {'title': '', 'media_list_data': [], 'string_l...
576 {'title': '', 'media_list_data': [], 'string_l...
577 {'title': '', 'media_list_data': [], 'string_l...
578 {'title': '', 'media_list_data': [], 'string_l...
579 {'title': '', 'media_list_data': [], 'string_l...
[580 rows x 1 columns]
<class 'pandas.core.frame.DataFrame'>
How do I stop taking "relationships_followers" as a lonely column?
Trying to get an output like the below:
href value timestamp
0 www.inst... username1 DDMMYY
1 www.inst... username2 DDMMYY
2 www.inst... username3 DDMMYY
3 www.inst... username4 DDMMYY
...
578 www.inst... username578 DDMMYY
579 www.inst... username579 DDMMYY
CodePudding user response:
Try doing this to your master dict.
worthy_data = json_data.get('relationship_followers')
wanted_dicts = [k:v for (k,v) in worthy_data.items() if k == 'string_list_data']
pd.DataFrame(wanted_dicts)
CodePudding user response:
In this case you can use pd.json_normalize() to extract the href
, value
, timestamp
columns from the string_list_data
dictionary.
pd.json_normalize(json_data['relationships_followers'], 'string_list_data')
# Output :
# href value timestamp
# 0 https://www.instagram.com/username1 username1 1655411505
# 1 https://www.instagram.com/username2 username2 1655149264
# 2 https://www.instagram.com/username3 username3 1655129904