Home > Software engineering >  Dataframe with Array Column to New Dataframe
Dataframe with Array Column to New Dataframe

Time:12-19

Actually i have an feather file that im loading to an Dataframe.

And this Dataframe have a column with languages in each row. Like the abone one:

student_id name created_at languages
1 Foo 2019-01-03 14:30:32.146000 00:00 [{'language_id': 1, 'name': 'English', 'optin_...
2 Bar 2019-01-03 14:30:32.146000 00:00 [{'language_id': 1, 'name': 'English', 'optin_...

My question is: How can i generate an new Dataframe only with student_id column and the rest of languages array?

For example the above one:

student_id language_id language_name optin_at
1 1 English 2019-01-03T14:30:32.148Z
2 1 English 2021-05-30T00:33:02.915Z
2 2 Portuguese 2022-03-07T07:42:07.082Z

// EDIT:

Exported Dataframe as JSON (orient='records') for testing purposes:

[{"student_id":"1","name":"Foo","created_at":"2019-01-03T14:30:32.146Z","languages":[{"language_id":1,"name":"English","optin_at":"2019-01-03T14:30:32.148Z"}]},{"student_id":"2","name":"Bar","created_at":"2019-01-03T14:30:32.146Z","languages":[{"language_id":1,"name":"English","optin_at":"2021-05-30T00:33:02.915Z"},{"language_id":2,"name":"Portuguese","optin_at":"2022-03-07T07:42:07.082Z"}]}]

CodePudding user response:

You can use explode, then convert to columns with json_normalize:

out = (df
  .explode('languages', ignore_index=True)
  .pipe(lambda d: d.join(pd.json_normalize(d.pop('languages'))
                           .rename(columns={'name': 'language_name'})
                        ))
)

Output:

  student_id name                created_at  language_id language_name                  optin_at
0          1  Foo  2019-01-03T14:30:32.146Z            1       English  2019-01-03T14:30:32.148Z
1          2  Bar  2019-01-03T14:30:32.146Z            1       English  2021-05-30T00:33:02.915Z
2          2  Bar  2019-01-03T14:30:32.146Z            2    Portuguese  2022-03-07T07:42:07.082Z
  • Related