I have a file like this:
{"Name": "John", "age": 15}{"Name": "Anna", "age": 12}
they are on the same line. What kind of format this file belongs to? How to read it into pandas dataframe so that
name age
John 15
Anna 12
Thanks!
CodePudding user response:
Approach 1 (use regex)
In your case, you may read the content of your file using:
with open('file_path', 'r') as f:
content = f.read()
but in my test I will just assign content
with your example line
content = '''{"Name": "John", "age": 15}{"Name": "Anna", "age": 12}'''
Then re.findall
to extract the data into a list of tuples.
import re
data = re.findall(r'{"Name": "([^"]*)", "age": (\d )}', content)
print(data)
[('John', '15'), ('Anna', '12')]
Then build the dataframe with
pd.DataFrame(data, columns=['Name', 'age'])
Note: re.findall
attempts to find this pattern {"Name": "([^"]*)", "age": (\d )}
from content
, and anything within the brackets ()
is extracted. ([^"]*)
is used for Name
and means any length of string that does not include a "
(so my assumption is that a name field never contains a "
. For age
, (\d )
means any length (>=1) of digits.
Approach 2 (use json)
Another way is to make your content
a json.
import json
pd.DataFrame(json.loads('[' content.replace('}{"Name": ', '},{"Name": ') ']'))
CodePudding user response:
As a supplement for @Raymond Kwok's answer. We don't need know the field names before converting the json to DataFrame
.
import pandas as pd
import re
# read file code same, so skip
js = '{"Name": "John", "age": 15}{"Name": "Anna", "age": 12}'
matches = re.findall(r'\{. ?}', js)
print(matches)
df = pd.DataFrame.from_dict(matches)
print(df)
output:
['{"Name": "John", "age": 15}', '{"Name": "Anna", "age": 12}']
0
0 {"Name": "John", "age": 15}
1 {"Name": "Anna", "age": 12}