I have a dataframe which has a column of list of dictionaries, which looks like
[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}]
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}]
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
And I expected that json_normalize would put data into columns like
first_open_time.int_value, first_open_time.set_timestamp_micros, User_dedication.string_value, User_dedication.set_timestamp_micros, etc.
Instead it just split it into 7 columns with dictionaries:
{'first_open_time.int_value': '1652796000000', 'first_open_time.set_timestamp_micros': '1652792823456000'} {'User_dedication.string_value': '1', 'User_dedication.set_timestamp_micros': '1653137417352000'} {'User_activity.string_value': '1', 'User_activity.set_timestamp_micros': '1653136561498000'}
That look almost what I need, but still dictionaries. And some rows are out of order, like in the first example.
I tried to specify Meta (as I understood from some manuals)
df3 = pd.json_normalize(df3,
meta=[['first_open_time', 'int_value'], ['first_open_time', 'set_timestamp_micros'],
['User_dedication', 'string_value'], ['User_dedication', 'set_timestamp_micros'],
['User_activity', 'string_value'], ['User_activity', 'set_timestamp_micros'],
['Minutes_in_app', 'string_value'], ['Minutes_in_app', 'set_timestamp_micros'],
['ga_session_number', 'int_value'], ['ga_session_number', 'set_timestamp_micros'],
['Paying_user', 'string_value'], ['Paying_user', 'set_timestamp_micros'],
['ga_session_id', 'int_value'], ['ga_session_id', 'set_timestamp_micros']])
But it gives AttributeError: 'list' object has no attribute 'values'.
Maybe some of the problems are from the fact, that dictionaries are out of order in some rows, and some rows have lesser amount of dictionaries. That's how Bigquery put events.
Is there any way to solve that problem? Maybe to sort all rows of dictionaries, so that all of them will be in order or specify each column and which value should go there?
CodePudding user response:
json_normalize
can be applied on each of the elements of your original data. However, you'll get a lot a Nan
value if you don't flatten the returned df.
You can proceed with a loop, concatenating all flattened rows:
data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
]
df = pd.DataFrame()
for d in data:
df_tmp = pd.json_normalize(d)
row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
row.columns = df_tmp.columns
df = pd.concat([df, row])
print(df.reset_index(drop=True))
Output:
first_open_time.int_value first_open_time.set_timestamp_micros User_dedication.string_value ... Paying_user.set_timestamp_micros ga_session_id.int_value ga_session_id.set_timestamp_micros
0 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
1 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
2 1652796000000 1652792823456000 1 ... 1653136561498000 1653136552 1653136552555000
3 1653195600000 1653193960416000 NaN ... NaN NaN NaN
4 1653102000000 1653098744032000 1 ... NaN 1653165977 1653165977727000
[5 rows x 14 columns]