Use a Generator To Convert JSON and TSV Data into a Dictionary-CodePudding

We need to get the data from the file file.data into a DataFrame. The problem is that the data on each line of the file is in either a JSON or Tab-separated values (TSV) format.

The JSON lines are in the correct format, they just need to be converted to native Python dicts.

The TSV lines need to be converted in to dicts that match the JSON format.

Here is a sample of the file:

{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons    Persistent contextually-based standardization   018.666.0600    America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net   (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": " 1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}

Write a generator that takes an iterator as an argument. It should parse the values in the iterator and yield each value in the correct format: A dict with the keys:

company
catch_phrase
phone
timezone
client_count

My code so far:

df = pd.read_csv("file.data", sep="\t")
    for col in df[["company"]]:
        obj = df[col]
        for item in obj.values:
            json_obj = json.loads(item)

CodePudding user response：

Don't use pandas to read the entire file. Instead, read the file line by line, and create a list of dicts. Then use pandas to get your dataframe.

dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
    for line in f:
        line = line.strip()
        try:
            d = json.loads(line)
            dict_data.append(d)
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list

After this, we have

dict_data = [{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': ' 1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]

tsv_data = [['Bennett and Sons',
  'Persistent contextually-based standardization',
  '018.666.0600',
  'America/Los_Angeles',
  '492'],
 ['Ferguson-Garner',
  'Multi-layered tertiary neural-net',
  '(086)401-8955x53502',
  'America/Los_Angeles',
  '528']]

Notice that everything in tsv_data is a string, so we're going to have to fix that at some point.

Now, create a dataframe using the two lists dict_data and tsv_data, change the data type for the tsv dataframe, and join the two.

data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)


for column in df_tsv:
    df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)

df_all = df_dict.append(df_tsv).reset_index(drop=True)

df_all looks like this:

	company	catch_phrase	phone	timezone	client_count
0	Watkins Inc	Integrated radical installation	7712422719	America/New_York	442
1	Pennington PLC	Future-proofed tertiary frame	1-312-296-2956x137	America/Indiana/Indianapolis	638
2	Bennett and Sons	Persistent contextually-based standardization	018.666.0600	America/Los_Angeles	492
3	Ferguson-Garner	Multi-layered tertiary neural-net	(086)401-8955x53502	America/Los_Angeles	528

Applying this to work with a generator function like you originally wanted:

def parse_file(file_iterator):
    dict_keys_types = None

    for line in file_iterator:
        line = line.strip()
        try:
            d = json.loads(line)
            # When you read a valid dict, set the keys and types 
            # So you can parse the tsv lines
            dict_keys_types = [
                  (key, type(value)) 
                  for key, value in d.items()
             ]
             yield d
        except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
            tsv_data = line.split("\t")
            # Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
            tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
            yield tsv_dict

Now, you can pass a file iterator to this function and it'll yield dictionaries like you want:

list(parse_file(f))

[{'company': 'Watkins Inc',
  'catch_phrase': 'Integrated radical installation',
  'phone': '7712422719',
  'timezone': 'America/New_York',
  'client_count': 442},
 {'company': 'Bennett and Sons',
  'catch_phrase': 'Persistent contextually-based standardization',
  'phone': '018.666.0600',
  'timezone': 'America/Los_Angeles',
  'client_count': 492},
 {'company': 'Ferguson-Garner',
  'catch_phrase': 'Multi-layered tertiary neural-net',
  'phone': '(086)401-8955x53502',
  'timezone': 'America/Los_Angeles',
  'client_count': 528},
 {'company': 'Pennington PLC',
  'catch_phrase': 'Future-proofed tertiary frame',
  'phone': ' 1-312-296-2956x137',
  'timezone': 'America/Indiana/Indianapolis',
  'client_count': 638}]