Home > Back-end >  Cannot insert extracted json into dataframe column
Cannot insert extracted json into dataframe column

Time:03-04

I have a DataFrame that has a json array as the values of one column. I want to select one of those as the values of the column and get rid of the rest. I have gotten the desired values into a series but I can't figure out how to join them back to the DataFrame in place of the existing column:

import json
from pandas.io.json import json_normalize

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)

lst = []
for d in df['data']:
    d = pd.read_json(d, lines=True)['uid'].values[0]
    lst.append(d)
s = pd.DataFrame(lst)
df['data'] = s
print(s)
print(df)  

returns

     0
0  100
1  100
2  200
3  300
       bank_account  data
Alice           101   NaN
Alice           102   NaN
Bob             201   NaN
Carol           301   NaN

currently and I don't know why the data column shows all nan values. Any help appreciated.

Updated Issue: Some of the rows have lists of json arrays instead of just one. Here is what I have so far:

import json
from pandas.io.json import json_normalize

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '[{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"},{"uid": 150, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "jer"}]',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)

# df["data"] = df["data"].apply(lambda x: pd.read_json(x, lines=True)["uid"][0])

df["data"] = df["data"].apply(lambda array : (",".join(list(map(lambda x : pd.read_json(x, lines=True)["uid"][0], array),(df['data'])))))
print(df)

CodePudding user response:

This works for me:

df = pd.DataFrame({
    'bank_account': [101, 102, 201, 301],
    'data': [
        '{"uid": 100, "account_type": 1, "account_data": {"currency": {"current": 1000, "minimum": -500}, "fees": {"monthly": 13.5}}, "user_name": "Alice"}',
        '{"uid": 100, "account_type": 2, "account_data": {"currency": {"current": 2000, "minimum": 0},  "fees": {"monthly": 0}}, "user_name": "Alice"}',
        '{"uid": 200, "account_type": 1, "account_data": {"currency": {"current": 3000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Bob"}',        
        '{"uid": 300, "account_type": 1, "account_data": {"currency": {"current": 4000, "minimum": 0},  "fees": {"monthly": 13.5}}, "user_name": "Carol"}'        
    ]},
    index = ['Alice', 'Alice', 'Bob', 'Carol']
)

df["data"] = df["data"].apply(lambda x: pd.read_json(x, lines=True)["uid"][0])

Your code does not work because df and s have different indices. If you want to fix your code set df['data'] = s[0].values (instead of df['data'] = s) before your two print statements.

CodePudding user response:

As @rachwa notes, the issue is that the indexes don't match because the index of s is numbers while the index of df is names. If you assign lst directly instead of casting it to a DataFrame, you will get the desired outcome, i.e.

df['data'] = lst

would work as expected.

You could also use json.loads instead of read_json (it should be faster):

import json
df['data'] = [json.loads(d)['uid'] for d in df['data']]

Output:

       bank_account  data
Alice           101   100
Alice           102   100
Bob             201   200
Carol           301   300
  • Related