Using pandas df.apply with a function that returns a dictionary-CodePudding

I have a JSON file from which I'm initially reading into a pandas DF. It looks like this:

{
  ...
  ...
"Info": [
            {
                "Type": "A",
                "Desc": "4848",
                ...
            },
            {
                "Type": "P",
                "Desc": "3763",
                ...
            },
            {
                "Type": "S",
                "Desc": "AUBERT",
                ...
            }
        ],
...
}

I have a function that will loop over the "Info" field and depending on "Type" will store information into a dictionary and return that dictionary. Then I want to create new columns in my df based on the values stored in the dictionary using df.apply. Please see below:

def extract_info(self):
    def extract_data(df):
        dic = {'a': None, 'p': None, 's': None}
        for info in df['Info']:

            if info['Type'] == "A":
                dic['a'] = info['Desc']
            if info['Type'] == "P":
                dic['p'] = info['Desc']
            if info['Type'] == "S":
                dic['s'] = info['Desc']
        return dic

self.df['A'] = self.df.apply(extract_data, axis=1)['a']
self.df['P'] = self.df.apply(extract_data, axis=1)['p']
self.df['S'] = self.df.apply(extract_data, axis=1)['s']

return self

I have also tried doing:

self.df['A'] = self.df.apply(lambda x: extract_data(x['a']), axis=1)

But these are not working for me. I have looked at other SO posts about using df.apply with function that returns dictionary but did not find what I need for my case. Please help.

I could write 3 separate functions like extract_A, extract_B and extract_C and return single values each to make df.apply work but that means running the for loop 3 times, one for each function. Any other suggestions other than use of a dictionary is welcome too. Thanks.

CodePudding user response：

I'm not sure where you're getting at with your nested functions and your use of self. I think you can get what you need with a single function:

input_dict = {
    "col1": [1, 2, 3],
    "Info": [
                {
                    "Type": "A",
                    "Desc": "4848",
                },
                {
                    "Type": "P",
                    "Desc": "3763",
                },
                {
                    "Type": "S",
                    "Desc": "AUBERT",
                }
            ]
}

def extract_data(info_col, typ):
    if info_col['Type'] == typ:
        return info_col['Desc']

df = pd.DataFrame(input_dict)
df['A'] = df['Info'].apply(lambda x: extract_data(x, 'A'))
df['P'] = df['Info'].apply(lambda x: extract_data(x, 'P'))
df['S'] = df['Info'].apply(lambda x: extract_data(x, 'S'))

Output:

   col1                             Info     A     P       S
0     1    {'Type': 'A', 'Desc': '4848'}  4848  None    None
1     2    {'Type': 'P', 'Desc': '3763'}  None  3763    None
2     3  {'Type': 'S', 'Desc': 'AUBERT'}  None  None  AUBERT

Is this what you're looking for?

CodePudding user response：

Instead of storing it in a dictionary, I can store them as variables and return them in my extract_data function. Then I can assign these values to new columns in my self.df directly using result_type parameter in df.apply.

def extract_info(self):
    def extract_data(df):
        a = None
        p = None
        s = None
        for info in df['Info']:
            if info['Type'] == "A":
                a = info['Desc']
            if info['Type'] == "P":
                p = info['Desc']
            if info['Type'] == "S":
                s = info['Desc']

        return a, p, s

self.df[['A', 'P', 'S']] = self.df.apply(extract_data, axis=1, result_type="expand")

return self

Output:

       A     P       S
0    4848  3763    AUBERT
...
...