I have a dataset with key-value pairs in a CSV file that looks similar to this:
"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
My expected output is the following:
id construction_year available_date useful_area wheelchair_accessible
1 1900 Vereinbarung nan nan
2 nan 01.04.202 60.0 nan
3 2020 sofort nan nan
4 nan Vereinbarung nan true
I have already tried to convert this data to a dict
using json.loads
and then parsing it. This method works on a small scale, when I can assure that all rows are perfectly formatted in a JSON style.
However, when I try to use json.loads
on a list
with the 200'000 observations, I always get some error since not all rows are on the right JSON format. For example, sometimes there is a "value" missing from a "key", sometimes there is a {
on the wrong place, so json.loads
results in the following error: JSONDecodeError: Expecting property name enclosed in double quotes
I have tried for days to fix the whole data into a JSON friendly format, but this seems impossible, I received this dataset in a terrible format and very messy.
So I wanted to know if someone could come up with a function which would allow me to split the key-value pairs into individual columns, without having to use json.loads
.
Thanks in advance.
CodePudding user response:
It looks like someone scrape JavaScript code and saved as CSV string.
"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
It needs to convert CSV string back to normal string and later parse it.
Or it needs to change text in lines to correct JSON data
[1, {"key": "construction_year", "value": 1900}, {"key": "available_date", "value": "Vereinbarung"}]
which can be converted to 3 columns.
And later you can convert dictionaries to one dictionary
[1, {'construction_year': 1900, 'available_date': 'Vereinbarung'}]
which can be converted to columns using pandas
and .apply(pd.Series)
I uses text
as string but you could read it from file
text = '''"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
'''
import pandas as pd
#text = open('data.csv').read()
rows = []
for line in text.splitlines():
line = line.replace('""', '"')
line = '[' line[1:-1] ']'
line = json.loads(line)
item = {}
for d in line[1:]:
key = d['key']
val = d['value']
item[key] = val
rows.append( [line[0], item] )
df = pd.DataFrame(rows, columns=['id', 'data'])
# convert dictionaries to columns
df = df.join(df['data'].apply(pd.Series))
# remove column with dictionaries
del df['data']
print(df.to_string())
Result:
id construction_year available_date useful_area wheelchair_accessible
0 1 1900.0 Vereinbarung NaN NaN
1 2 NaN 01.04.2022 60.0 NaN
2 3 2020.0 sofort NaN NaN
3 4 NaN Vereinbarung NaN True