what is this format and how to read it in pandas-CodePudding

I have a file like this:

{"Name": "John", "age": 15}{"Name": "Anna", "age": 12}

they are on the same line. What kind of format this file belongs to? How to read it into pandas dataframe so that

name    age
John    15
Anna    12

Thanks!

CodePudding user response：

Approach 1 (use regex)

In your case, you may read the content of your file using:

with open('file_path', 'r') as f:
    content = f.read()

but in my test I will just assign content with your example line

content = '''{"Name": "John", "age": 15}{"Name": "Anna", "age": 12}'''

Then re.findall to extract the data into a list of tuples.

import re    
data = re.findall(r'{"Name": "([^"]*)", "age": (\d )}', content)

print(data)
[('John', '15'), ('Anna', '12')]

Then build the dataframe with

pd.DataFrame(data, columns=['Name', 'age'])

Note: re.findall attempts to find this pattern {"Name": "([^"]*)", "age": (\d )} from content, and anything within the brackets () is extracted. ([^"]*) is used for Name and means any length of string that does not include a " (so my assumption is that a name field never contains a ". For age, (\d ) means any length (>=1) of digits.

Approach 2 (use json)

Another way is to make your content a json.

import json

pd.DataFrame(json.loads('['   content.replace('}{"Name": ', '},{"Name": ')   ']'))

CodePudding user response：

As a supplement for @Raymond Kwok's answer. We don't need know the field names before converting the json to DataFrame.

import pandas as pd
import re
# read file code same, so skip
js = '{"Name": "John", "age": 15}{"Name": "Anna", "age": 12}'
matches = re.findall(r'\{. ?}', js)
print(matches)
df = pd.DataFrame.from_dict(matches)
print(df)

output:

['{"Name": "John", "age": 15}', '{"Name": "Anna", "age": 12}']
                             0
0  {"Name": "John", "age": 15}
1  {"Name": "Anna", "age": 12}