Pandas json_normalize list of dictionaries into specified columns-CodePudding

I have a dataframe which has a column of list of dictionaries, which looks like

[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}]
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}]
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}]
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]

And I expected that json_normalize would put data into columns like

first_open_time.int_value, first_open_time.set_timestamp_micros, User_dedication.string_value, User_dedication.set_timestamp_micros, etc.

Instead it just split it into 7 columns with dictionaries:

{'first_open_time.int_value': '1652796000000', 'first_open_time.set_timestamp_micros': '1652792823456000'}  {'User_dedication.string_value': '1', 'User_dedication.set_timestamp_micros': '1653137417352000'}   {'User_activity.string_value': '1', 'User_activity.set_timestamp_micros': '1653136561498000'}

That look almost what I need, but still dictionaries. And some rows are out of order, like in the first example.

I tried to specify Meta (as I understood from some manuals)

df3 = pd.json_normalize(df3,
                    meta=[['first_open_time', 'int_value'], ['first_open_time', 'set_timestamp_micros'],
                          ['User_dedication', 'string_value'], ['User_dedication', 'set_timestamp_micros'],
                          ['User_activity', 'string_value'], ['User_activity', 'set_timestamp_micros'],
                          ['Minutes_in_app', 'string_value'], ['Minutes_in_app', 'set_timestamp_micros'],
                          ['ga_session_number', 'int_value'], ['ga_session_number', 'set_timestamp_micros'],
                          ['Paying_user', 'string_value'], ['Paying_user', 'set_timestamp_micros'],
                          ['ga_session_id', 'int_value'], ['ga_session_id', 'set_timestamp_micros']])

But it gives AttributeError: 'list' object has no attribute 'values'.

Maybe some of the problems are from the fact, that dictionaries are out of order in some rows, and some rows have lesser amount of dictionaries. That's how Bigquery put events.

Is there any way to solve that problem? Maybe to sort all rows of dictionaries, so that all of them will be in order or specify each column and which value should go there?

CodePudding user response：

json_normalize can be applied on each of the elements of your original data. However, you'll get a lot a Nan value if you don't flatten the returned df.

You can proceed with a loop, concatenating all flattened rows:

data = [[{'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137417352000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137417352000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137166688000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137166688000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}],
[{'Minutes_in_app': {'string_value': '60_300', 'set_timestamp_micros': '1653137288213000'}}, {'Paying_user': {'string_value': '0', 'set_timestamp_micros': '1653136561498000'}}, {'first_open_time': {'int_value': '1652796000000', 'set_timestamp_micros': '1652792823456000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653137288213000'}}, {'User_activity': {'string_value': '1', 'set_timestamp_micros': '1653136561498000'}}, {'ga_session_number': {'int_value': '10', 'set_timestamp_micros': '1653136552555000'}}, {'ga_session_id': {'int_value': '1653136552', 'set_timestamp_micros': '1653136552555000'}}],
[{'first_open_time': {'int_value': '1653195600000', 'set_timestamp_micros': '1653193960416000'}}],
[{'ga_session_number': {'int_value': '3', 'set_timestamp_micros': '1653165977727000'}}, {'User_activity': {'string_value': '1_10', 'set_timestamp_micros': '1653109414730000'}}, {'Minutes_in_app': {'string_value': '1_10', 'set_timestamp_micros': '1653109414735000'}}, {'first_open_time': {'int_value': '1653102000000', 'set_timestamp_micros': '1653098744032000'}}, {'User_dedication': {'string_value': '1', 'set_timestamp_micros': '1653109414734000'}}, {'ga_session_id': {'int_value': '1653165977', 'set_timestamp_micros': '1653165977727000'}}]
]

df = pd.DataFrame()

for d in data:
    df_tmp = pd.json_normalize(d)
    row = pd.DataFrame(df_tmp.to_numpy().flatten()).T.dropna(axis=1)
    row.columns = df_tmp.columns
    df = pd.concat([df, row])

print(df.reset_index(drop=True))

Output:

  first_open_time.int_value first_open_time.set_timestamp_micros User_dedication.string_value  ... Paying_user.set_timestamp_micros ga_session_id.int_value ga_session_id.set_timestamp_micros
0             1652796000000                     1652792823456000                            1  ...                 1653136561498000              1653136552                   1653136552555000
1             1652796000000                     1652792823456000                            1  ...                 1653136561498000              1653136552                   1653136552555000
2             1652796000000                     1652792823456000                            1  ...                 1653136561498000              1653136552                   1653136552555000
3             1653195600000                     1653193960416000                          NaN  ...                              NaN                     NaN                                NaN
4             1653102000000                     1653098744032000                            1  ...                              NaN              1653165977                   1653165977727000

[5 rows x 14 columns]