Home > OS >  Use a Generator To Convert JSON and TSV Data into a Dictionary
Use a Generator To Convert JSON and TSV Data into a Dictionary

Time:10-07

We need to get the data from the file file.data into a DataFrame. The problem is that the data on each line of the file is in either a JSON or Tab-separated values (TSV) format.

The JSON lines are in the correct format, they just need to be converted to native Python dicts.

The TSV lines need to be converted in to dicts that match the JSON format.

Here is a sample of the file:

{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons    Persistent contextually-based standardization   018.666.0600    America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net   (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": " 1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}

Write a generator that takes an iterator as an argument. It should parse the values in the iterator and yield each value in the correct format: A dict with the keys:

  • company
  • catch_phrase
  • phone
  • timezone
  • client_count

My code so far:

df = pd.read_csv("file.data", sep="\t")
    for col in df[["company"]]:
        obj = df[col]
        for item in obj.values:
            json_obj = json.loads(item)

CodePudding user response:

Don't use pandas to read the entire file. Instead, read the file line by line, and create a list of dicts. Then use pandas to get your dataframe.

dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
    for line in f:
        line = line.strip()
        try:
            d = json.loads(line)
            dict_data.append(d)
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list

After this, we have

dict_data = [{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': ' 1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]

tsv_data = [['Bennett and Sons',
  'Persistent contextually-based standardization',
  '018.666.0600',
  'America/Los_Angeles',
  '492'],
 ['Ferguson-Garner',
  'Multi-layered tertiary neural-net',
  '(086)401-8955x53502',
  'America/Los_Angeles',
  '528']]

Notice that everything in tsv_data is a string, so we're going to have to fix that at some point.

Now, create a dataframe using the two lists dict_data and tsv_data, change the data type for the tsv dataframe, and join the two.

data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)


for column in df_tsv:
    df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)

df_all = df_dict.append(df_tsv).reset_index(drop=True)

df_all looks like this:

company catch_phrase phone timezone client_count
0 Watkins Inc Integrated radical installation 7712422719 America/New_York 442
1 Pennington PLC Future-proofed tertiary frame 1-312-296-2956x137 America/Indiana/Indianapolis 638
2 Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
3 Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528

Applying this to work with a generator function like you originally wanted:

def parse_file(file_iterator):
    dict_keys_types = None

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict
        

Now, you can pass a file iterator to this function and it'll yield dictionaries like you want:

list(parse_file(f))

[{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Bennett and Sons',
  'catch_phrase': 'Persistent contextually-based standardization',
  'phone': '018.666.0600',
  'timezone': 'America/Los_Angeles',
  'client_count': 492},
 {'company': 'Ferguson-Garner',
  'catch_phrase': 'Multi-layered tertiary neural-net',
  'phone': '(086)401-8955x53502',
  'timezone': 'America/Los_Angeles',
  'client_count': 528},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': ' 1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]

  • Related