Home > Mobile >  Extracting certain keys from a JSON dataframe in pandas
Extracting certain keys from a JSON dataframe in pandas

Time:07-16

So I have this JSON

{
  "data": [
    {
      "id": 2674,
      "timestamp": "2022-07-01",
      "vendor_name": "test",
      "npi_id": "1234567890",
      "trigger_description": "test",
      "echo_order": 0,
      "duration": "0",
      "image_url": "https://....",
      "name": "Banner Ad Name 1",
      "triggers": ["test "]
    },
    {
   ...

Which I am reading like so df = pd.read_json("./data/data.json", typ="frame")

The output is ...

> python main.py
                                                data
0  {'id': 2674, 'timestamp': '2022-07-01', 'vendo...
1  {'id': 2675, 'timestamp': '2022-07-01', 'vendo...
2  {'id': 6789, 'timestamp': '2022-07-01', 'vendo...
3  {'id': 2321, 'timestamp': '2022-07-01', 'vendo...
4  {'id': 5678, 'timestamp': '2022-07-01', 'vendo...
5  {'id': 1234, 'timestamp': '2022-07-01', 'vendo...

I am looking to extract certain cols from the output. But for the life of me I can't figure out how. The data is within the key data which is a list.

When you do df.columns on the dataframe you only get Index(['data'], dtype='object')

How can i get inside data and extract what i need

CodePudding user response:

So I have managed to do it...but I am not sure if it's the true Pandas way od doing it.

df = pd.read_json("./data/data.json", typ="frame")

    attribute_values = []

    for item in df["data"]:

        data = {
            "id": item["id"],
            "timestamp": item["timestamp"],
            "npi_id": item["npi_id"],
        }

        row = [data["id"], data["timestamp"], data["npi_id"]]
        attribute_values.append(row)
df = pd.read_json("./data/data.json", typ="frame")

Gives me the following:

data
0  {'id': 2674, 'timestamp': '2022-07-01', 'vendo...
1  {'id': 2675, 'timestamp': '2022-07-01', 'vendo...
2  {'id': 6789, 'timestamp': '2022-07-01', 'vendo...
3  {'id': 2321, 'timestamp': '2022-07-01', 'vendo...
4  {'id': 5678, 'timestamp': '2022-07-01', 'vendo...
5  {'id': 1234, 'timestamp': '2022-07-01', 'vendo...

Then using

for item in df["data"]:

        data = {
            "id": item["id"],
            "timestamp": item["timestamp"],
            "npi_id": item["npi_id"],
        }

        row = [data["id"], data["timestamp"], data["npi_id"]]
        attribute_values.append(row)

gives me

[[2674, '2022-07-01', '1003883562'], [2675, '2022-07-01', '1043283849'], [6789, '2022-07-01', '1043283849'], [2321, '2022-07-01', '1043283849'], [5678, '2022-07-01', '1043283849'], [1234, '2022-07-01', '1043283849']]

Which is just id timestamp npi_id from the dataset.

CodePudding user response:

You can use .str

out = df['data'].str['id']
print(out)

0    2674
Name: data, dtype: int64

Or use pd.json_normalize

import json
with open('data.json', 'r', encoding='utf-8') as f:
    data = json.loads(f.read())
df = pd.json_normalize(data['data'])
print(df)

     id   timestamp vendor_name      npi_id trigger_description  echo_order duration     image_url              name triggers
0  2674  2022-07-01        test  1234567890                test           0        0  https://....  Banner Ad Name 1  [test ]
  • Related