Python - Function for parsing key-value pairs into DataFrame columns-CodePudding

I have a dataset with key-value pairs in a CSV file that looks similar to this:

"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"

My expected output is the following:

id      construction_year   available_date   useful_area   wheelchair_accessible
1       1900                Vereinbarung     nan           nan
2       nan                 01.04.202        60.0          nan
3       2020                sofort           nan           nan 
4       nan                 Vereinbarung     nan           true

I have already tried to convert this data to a dict using json.loads and then parsing it. This method works on a small scale, when I can assure that all rows are perfectly formatted in a JSON style.

However, when I try to use json.loads on a list with the 200'000 observations, I always get some error since not all rows are on the right JSON format. For example, sometimes there is a "value" missing from a "key", sometimes there is a { on the wrong place, so json.loads results in the following error: JSONDecodeError: Expecting property name enclosed in double quotes

I have tried for days to fix the whole data into a JSON friendly format, but this seems impossible, I received this dataset in a terrible format and very messy.

So I wanted to know if someone could come up with a function which would allow me to split the key-value pairs into individual columns, without having to use json.loads.

Thanks in advance.

CodePudding user response：

It looks like someone scrape JavaScript code and saved as CSV string.

"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"

It needs to convert CSV string back to normal string and later parse it.

Or it needs to change text in lines to correct JSON data

[1, {"key": "construction_year", "value": 1900}, {"key": "available_date", "value": "Vereinbarung"}]

which can be converted to 3 columns.

And later you can convert dictionaries to one dictionary

[1, {'construction_year': 1900, 'available_date': 'Vereinbarung'}]

which can be converted to columns using pandas and .apply(pd.Series)

I uses text as string but you could read it from file

text = '''"1, {""key"": ""construction_year"", ""value"": 1900}, {""key"": ""available_date"", ""value"": ""Vereinbarung""}"
"2, {""key"": ""available_date"", ""value"": ""01.04.2022""}, {""key"": ""useful_area"", ""value"": 60.0}"
"3, {""key"": ""construction_year"", ""value"": 2020}, {""key"": ""available_date"", ""value"": ""sofort""}"
"4, {""key"": ""available_date"", ""value"": ""Vereinbarung""}, {""key"": ""wheelchair_accessible"", ""value"": true}"
'''

import pandas as pd

#text = open('data.csv').read()

rows = []
for line in text.splitlines():
    line = line.replace('""', '"')
    line = '['   line[1:-1]   ']'
    line = json.loads(line)

    item = {}
    for d in line[1:]:
        key = d['key']
        val = d['value']
        item[key] = val

    rows.append( [line[0], item] )
    
df = pd.DataFrame(rows, columns=['id', 'data'])

# convert dictionaries to columns
df = df.join(df['data'].apply(pd.Series))

# remove column with dictionaries
del df['data']

print(df.to_string())

Result:

    id  construction_year available_date  useful_area wheelchair_accessible
0   1             1900.0   Vereinbarung          NaN                   NaN
1   2                NaN     01.04.2022         60.0                   NaN
2   3             2020.0         sofort          NaN                   NaN
3   4                NaN   Vereinbarung          NaN                  True