We need to get the data from the file file.data into a DataFrame. The problem is that the data on each line of the file is in either a JSON or Tab-separated values (TSV) format.
The JSON lines are in the correct format, they just need to be converted to native Python dicts.
The TSV lines need to be converted in to dicts that match the JSON format.
Here is a sample of the file:
{"company": "Watkins Inc", "catch_phrase": "Integrated radical installation", "phone": "7712422719", "timezone": "America/New_York", "client_count": 442}
Bennett and Sons Persistent contextually-based standardization 018.666.0600 America/Los_Angeles 492
Ferguson-Garner Multi-layered tertiary neural-net (086)401-8955x53502 America/Los_Angeles 528
{"company": "Pennington PLC", "catch_phrase": "Future-proofed tertiary frame", "phone": " 1-312-296-2956x137", "timezone": "America/Indiana/Indianapolis", "client_count": 638}
Write a generator that takes an iterator as an argument. It should parse the values in the iterator and yield each value in the correct format: A dict with the keys:
- company
- catch_phrase
- phone
- timezone
- client_count
My code so far:
df = pd.read_csv("file.data", sep="\t")
for col in df[["company"]]:
obj = df[col]
for item in obj.values:
json_obj = json.loads(item)
CodePudding user response:
Don't use pandas to read the entire file. Instead, read the file line by line, and create a list of dicts. Then use pandas to get your dataframe.
dict_data = []
tsv_data = []
with open('file.data', 'r') as f:
for line in f:
line = line.strip()
try:
d = json.loads(line)
dict_data.append(d)
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data.append(line.split("\t")) # Split the line by tabs, append to the tsv list
After this, we have
dict_data = [{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': ' 1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]
tsv_data = [['Bennett and Sons',
'Persistent contextually-based standardization',
'018.666.0600',
'America/Los_Angeles',
'492'],
['Ferguson-Garner',
'Multi-layered tertiary neural-net',
'(086)401-8955x53502',
'America/Los_Angeles',
'528']]
Notice that everything in tsv_data
is a string, so we're going to have to fix that at some point.
Now, create a dataframe using the two lists dict_data
and tsv_data
, change the data type for the tsv
dataframe, and join the two.
data_cols = list(dict_data[0].keys())
df_dict = pd.DataFrame(dict_data)
df_tsv = pd.DataFrame(tsv_data, columns=data_cols)
for column in df_tsv:
df_tsv[column] = df_tsv[column].astype(df_dict[column].dtype)
df_all = df_dict.append(df_tsv).reset_index(drop=True)
df_all
looks like this:
company | catch_phrase | phone | timezone | client_count | |
---|---|---|---|---|---|
0 | Watkins Inc | Integrated radical installation | 7712422719 | America/New_York | 442 |
1 | Pennington PLC | Future-proofed tertiary frame | 1-312-296-2956x137 | America/Indiana/Indianapolis | 638 |
2 | Bennett and Sons | Persistent contextually-based standardization | 018.666.0600 | America/Los_Angeles | 492 |
3 | Ferguson-Garner | Multi-layered tertiary neural-net | (086)401-8955x53502 | America/Los_Angeles | 528 |
Applying this to work with a generator function like you originally wanted:
def parse_file(file_iterator):
dict_keys_types = None
for line in file_iterator:
line = line.strip()
try:
d = json.loads(line)
# When you read a valid dict, set the keys and types
# So you can parse the tsv lines
dict_keys_types = [
(key, type(value))
for key, value in d.items()
]
yield d
except json.JSONDecodeError: # JSONDecodeError when you try to loads() a TSV line
tsv_data = line.split("\t")
# Iterate over tsv_data and dict_keys_types to convert the tsv data to a dict with the correct types
tsv_dict = {key: dtype(value) for value, (key, dtype) in zip(tsv_data, dict_keys_types)}
yield tsv_dict
Now, you can pass a file iterator to this function and it'll yield dictionaries like you want:
list(parse_file(f))
[{'company': 'Watkins Inc',
'catch_phrase': 'Integrated radical installation',
'phone': '7712422719',
'timezone': 'America/New_York',
'client_count': 442},
{'company': 'Bennett and Sons',
'catch_phrase': 'Persistent contextually-based standardization',
'phone': '018.666.0600',
'timezone': 'America/Los_Angeles',
'client_count': 492},
{'company': 'Ferguson-Garner',
'catch_phrase': 'Multi-layered tertiary neural-net',
'phone': '(086)401-8955x53502',
'timezone': 'America/Los_Angeles',
'client_count': 528},
{'company': 'Pennington PLC',
'catch_phrase': 'Future-proofed tertiary frame',
'phone': ' 1-312-296-2956x137',
'timezone': 'America/Indiana/Indianapolis',
'client_count': 638}]