Is there a way to normalize a json pulled straight from an api-CodePudding

Here is the type of json file that I am working with

{
  "header": {
    "gtfsRealtimeVersion": "1.0",
    "incrementality": "FULL_DATASET",
    "timestamp": "1656447045"
  },
  "entity": [{
    "id": "RTVP:T:16763243",
    "isDeleted": false,
    "vehicle": {
      "trip": {
        "tripId": "16763243",
        "scheduleRelationship": "SCHEDULED"
      },
      "position": {
        "latitude": 33.497833,
        "longitude": -112.07365,
        "bearing": 0.0,
        "odometer": 16512.0,
        "speed": 1.78816
      },
      "currentStopSequence": 16,
      "currentStatus": "INCOMING_AT",
      "timestamp": "1656447033",
      "stopId": "2792",
      "vehicle": {
        "id": "5074"
      }
    }
  }, {
    "id": "RTVP:T:16763242",
    "isDeleted": false,
    "vehicle": {
      "trip": {
        "tripId": "16763242",
        "scheduleRelationship": "SCHEDULED"
      },
      "position": {
        "latitude": 33.562374,
        "longitude": -112.07392,
        "bearing": 359.0,
        "odometer": 40367.0,
        "speed": 15.6464
      },
      "currentStopSequence": 36,
      "currentStatus": "INCOMING_AT",
      "timestamp": "1656447024",
      "stopId": "2794",
      "vehicle": {
        "id": "5251"
      }
    }
  },

In my code, I am taking in the json as a string. But when I try normalize json string to put into data frame

import pandas as pd
import json
import requests


base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
packages_json = base_URL.json()
packages_str = json.dumps(packages_json, indent=1)

df = pd.json_normalize(packages_str)

I get this error, I am definitely making some rookie error, but how exactly am I using this wrong? Are there additional arguments that may need in that?

---------------------------------------------------------------------------
NotImplementedError                       Traceback (most recent call last)
<ipython-input-33-aa23f9157eac> in <module>()
      8 packages_str = json.dumps(packages_json, indent=1)
      9 
---> 10 df = pd.json_normalize(packages_str)

/usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py in _json_normalize(data, record_path, meta, meta_prefix, record_prefix, errors, sep, max_level)
    421         data = list(data)
    422     else:
--> 423         raise NotImplementedError
    424 
    425     # check to see if a simple recursive function is possible to

NotImplementedError:

When I had the json format within my code without the header portion referenced as an object, the pd.json_normalize(package_str) does work, why would that be, and what additional things would I need to do?

CodePudding user response：

The issue is, that pandas.json_normalize expects either a dictionary or a list of dictionaries but json.dumps returns a string.

It should work if you skip the json.dumps and directly input the json to the normalizer, like this:

import pandas as pd
import json
import requests


base_URL = requests.get('https://app.mecatran.com/utw/ws/gtfsfeed/vehicles/valleymetro?apiKey=4f22263f69671d7f49726c3011333e527368211f&asJson=true')
packages_json = base_URL.json()

df = pd.json_normalize(packages_json)

If you take a look at the corresponding source-code of pandas you can see for yourself:

if isinstance(data, list) and not data:
    return DataFrame()
elif isinstance(data, dict):
    # A bit of a hackjob
    data = [data]
elif isinstance(data, abc.Iterable) and not isinstance(data, str):
    # GH35923 Fix pd.json_normalize to not skip the first element of a
    # generator input
    data = list(data)
else:
    raise NotImplementedError

You should find this code at the path that is shown in the stacktrace, with the error raised on line 423: /usr/local/lib/python3.7/dist-packages/pandas/io/json/_normalize.py

I would advise you to use a code-linter or an IDE that has one included (like PyCharm for example) as this is the type of error that doesn't happen if you have one.

CodePudding user response：

I m not sure where is the problem, but if you are desperate, you can always make text function that will data-mine that Json.

Yes, it will be quite tiring, but with -10 variables you need to mine, for each row, you will be done in -60 minutes no problem.

Something like this:

def MineJson(text, target): #target is for example "id"
    #print(text)
    findword = text.find(target)
    g=findword len(target) 5 #should not include the first "
    new_text= text[g:] #output should be starting with RTVP:T...
    return new_text

def WhatsAfter(text): #should return new text and RTVP:T:16763243
    #print(text)
    toFind='"'
    findEnd = text.find(toFind)
    g=findEnd
    value=text[:g]
    new_text= text[g:]
    return new_text,value

I wrote it without testing, so maybe there will be some mistakes.