How do I create lists from a Pandas column containing lists of json data?-CodePudding

This is the first question I'm ever asking on StackOverflow, so please don't tear me to shreds too harshly.

I have a Pandas DataFrame containing a "fieldsOfInterest" column with JSON data, similar to this (possibly not an accurate reproduction, will be afk for a few hours and then update this - wish you could hide questions here):

In: 
df = pd.DataFrame([
        ["1", [{"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_INDUSTRY"}, {"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY"}]],
        ["2", [{"code":"FOI_AGRICULTURE_FOOD|FOI_AF_SOMETHING_ELSE"}, {"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY"}]]
], columns = ["id", "fieldOfInterest"])
df
Out:
  id                                    fieldOfInterest
0  1  [{'code': 'FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_IN...
1  2  [{'code': 'FOI_AGRICULTURE_FOOD|FOI_AF_SOMETHI...

What I want to do is to add a new column that for each entry contains a list of all the "code" elements in the relevant entry in the old column, so for the first entry above

 ['FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_INDUSTRY', 
 'FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY']

I have a solution that works for a single row:

foi_normalized = pd.json_normalize(df["fieldsOfInterest"].iloc[1])
foi_codes = foi_normalized["code"]
foi_list = foi_codes.tolist()
print(foi_list)

But when I try a similar approach for the whole column...

def interest_reader(foi_old):
    foi_normalized = pd.json_normalize(foi_old)
    foi_codes = foi_normalized["code"]
    foi_list = foi_codes.tolist()
    return foi_list
df["fieldsOfInterest_new"] = df["fieldsOfInterest"].apply(interest_reader)

I got the error below:

File "...", line 15, in <module>
df["fieldsOfInterest_new"] =  df["fieldsOfInterest"].apply(interest_reader)
                             ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...", line 4771, in apply
return SeriesApply(self, func, convert_dtype, args, kwargs).apply()
       ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...", line 1105, in apply
return self.apply_standard()
       ^^^^^^^^^^^^^^^^^^^^^
File "...", line 1156, in apply_standard
mapped = lib.map_infer(
         ^^^^^^^^^^^^^^
File "pandas\_libs\lib.pyx", line 2918, in pandas._libs.lib.map_infer
File "...", line 11, in interest_reader
foi_normalized = pd.json_normalize(foi_old)
                 ^^^^^^^^^^^^^^^^^^^^^^^^^^
File "...", line 446, in _json_normalize
raise NotImplementedError
NotImplementedError

I have tried several other approaches but nothing has worked. I'm now thinking about approaching the values simply as dictionaries and for each entry looping through each one to get each value for the "code" key. I'd be glad about any pointers, thank you!

CodePudding user response：

You can first transform each element (the dicts) of the lists into a new row using explode. A new column id assigns these new rows to the index of the original dataframe from which the data originates. Then, you can extract the values from the dicts using json_normalize. Finally, you can collect all elements from the same row in the original dataframe and construct a list using groupby on said column id.

import pandas as pd

# setup your sample data
df = pd.DataFrame([
        ["1", [{"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_INDUSTRY"}, {"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY"}]],
        ["2", [{"code":"FOI_AGRICULTURE_FOOD|FOI_AF_SOMETHING_ELSE"}, {"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY"}]]
], columns = ["id", "fieldsOfInterest"])

# transform each element (the dicts) into a separate row
result = df.explode('fieldsOfInterest')

# extract the values from the dict
result['code'] = pd.json_normalize(result['fieldsOfInterest'])

# collect the element in a list
result.groupby('id')['code'].agg(list)

This results in the series

id
1    [FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_INDUSTRY, FO...
2    [FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY, FOI_AGR...

of which the first element is

['FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_INDUSTRY', 'FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_INDUSTRY']

using result.groupby('id')['code'].agg(list).iloc[0].

CodePudding user response：

I was being an idiot. My actual dataframe had a lot of NaNs:

In: 
df = pd.DataFrame([
        ["1", [{"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_INDUSTRY"}, {"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY"}]],
        ["2", [{"code":"FOI_AGRICULTURE_FOOD|FOI_AF_SOMETHING_ELSE"}, {"code":"FOI_AGRICULTURE_FOOD|FOI_AF_FORESTRY"}]]
], columns = ["id", "fieldsOfInterest"])
df
Out:
  id                                   fieldsOfInterest
0  1  [{'code': 'FOI_AGRICULTURE_FOOD|FOI_AF_FOOD_IN...
1  2                                                nan

Once I added a check for those into my function, it worked swimmingly:

def interest_reader(foi_old):
    if str(foi_old) == "nan":
        return "nan"
    foi_normalized = pd.json_normalize(foi_old)
    foi_codes = foi_normalized["code"]
    foi_list = foi_codes.tolist()
    return foi_list
df["fieldsOfInterest_new"] = df["fieldsOfInterest"].apply(interest_reader)
print(df["fieldsOfInterest_new"])