Home > Mobile >  JSON data parsing into pandas dataframe
JSON data parsing into pandas dataframe

Time:09-12

I have on specific issue that I cannot find a solution for, and I have looked at many of the articles here and other locations. I have successfully parsed a set of json files into a dataframe for the purpose of subsequent data analysis and cleanup tasks- the problem I am running into is that the parsing did not work for deeply nested json elements.

Here is the beginning of what each file generally looks like:

{
  "job_id": "0a0440fc-a651-4738-933e-51b5d1654831",
  "provider": "provider",
  "provider_version": "2.0.7",
  "timestamp": "2022-08-18T14:03:43.054532",
  "language": "en-US",
  "transcription": {
    "confidence": {
      "overall": 0.64,
      "channels": [
        0.64
      ]
    },
    "segments": [
      {
        "text": "welcome to provider to continue in english please press one but i contin into mercado",
        "formatted": "Welcome to provider, to continue in English, please press one but I contin into mercado.",
        "confidence": 0.4252,
        "channel": 0,
        "raw_data": [
          {
            "word": "welcome",
            "confidence": 0.4252,
            "start_ms": 400,
            "end_ms": 1120
           } # and each word receives this set of data including the "word", "confidence", "start_ms" (of the word) and "end_ms" of the word
          ],
        "start_ms": 400, # this is the start time of the full 
       utterance
        "end_ms": 6920 # this is the end time of the full utterance
      }, # and this is where the new json element starts  with "text" and "formatted" 

For reference, we have a few root elements that I do not care about, I only care about the root element "job_id". I also only want the nested elements with "transcription":

transcriptions-->segments-->text
transcriptions-->segments-->formatted

I do NOT want the sisters of "text" and "formatted" which include "confidence", "channel" and "raw_data" (this one being another list of dictionaries).

My code for parsing each file is below:

json_dir = 'provider_care_json'
json_pattern = os.path.join(json_dir, '*.json')
file_list = glob.glob(json_pattern)
dfs = []
for file in file_list:
    with open(file) as f:
        json_data = pd.json_normalize(json.loads(f.read()))
        json_data['job_id'] = file.rsplit("/", 1)[-1] 
    dfs.append(json_data)
vsdf = pd.concat(dfs)

So far so good - I end up with this structure:

vsdf.info()
<class 'pandas.core.frame.DataFrame'>
Int64Index: 50 entries, 0 to 0
Data columns (total 8 columns):
 #   Column                             Non-Null Count  Dtype  
---  ------                             --------------  -----  
 0   job_id                             50 non-null     object 
 1   provider                           50 non-null     object 
 2   provider_version                   50 non-null     object 
 3   timestamp                          50 non-null     object 
 4   language                           50 non-null     object 
 5   transcription.confidence.overall   50 non-null     float64
 6   transcription.confidence.channels  50 non-null     object 
 7   transcription.segments             50 non-null     object 
dtypes: float64(1), object(7)
memory usage: 3.5  KB

I drop the columns I do not need and end up with a dataframe that only has three columns, including my job_id (which is my document number):

 #   Column                  Non-Null Count  Dtype 
---  ------                  --------------  ----- 
 0   job_id                  100 non-null    object
 1   timestamp               100 non-null    object
 2   transcription.segments  100 non-null    object
dtypes: object(3)

The issue now is that the "transcription.segments" still has structure where it is a list with multiple embedded dictionaries, including the "confidence", "channel and "raw_data" dictionary sets.

Ultimately, aside from the "job_id" (for the document reference), the "text" and "formatted" elements are really the only two elements that I care about.

I have tried parsing the json individually for those two elements, but I lose the connection to the job_id. What would be the best way to further split the transcription.segments into something like transcription.segments.text and transcription.segments.formatted columns?

CodePudding user response:

I believe the following script what you are looking for,

import os 
import glob, json 
import pandas as pd 

json_dir = 'provider_care_json'
json_pattern = os.path.join(json_dir, '*.json')
file_list = glob.glob(json_pattern)
dfs = []
for file in file_list:
    with open(file) as f:
        json_data = json.loads(f.read()) 
        job_id = json_data["job_id"]
        texts = [] 
        for dic in json_data["transcription"]["segments"]:
            texts.append([dic.get("text"), dic.get("formatted")])

        # json_data['job_id'] = file.rsplit("/", 1)[-1] 
    # print(f"texts--->{texts}")
    # print(f"job_id--->{job_id}")
    df = pd.DataFrame(texts, columns = ["texts", "formatted"])
    df["job_id"] = job_id
    dfs.append(df)
vsdf = pd.concat(dfs)

print(vsdf)

Please check and let me know if you have any issues.

  • Related