Home > Net >  python - How to transform key-value pairs in Pandas dataframe
python - How to transform key-value pairs in Pandas dataframe

Time:04-02

I received this dataset which contains real estate data in key-value pairs in a .csv format.

If I drop the first line, I can load it with Pandas and get a dataframe like so:

id 1 [{'key'": '"floor'" '"value'": '"2. Floor'"} {'"key'": '"available_date'" "value'": '"nach Vereinbarung'"}
id 2 [{'key'": '"floor'" '"value'": '"1. Floor'"} {'"key'": '"living_space'" "value'": 81.0}
id 3 [{'key'": '"living_space'" '"value'": 240.0} {'"key'": '"construction_year'" '"value'": 2012}
id 4 [{'key'": '"living_space'" '"value'": 280.0} {'"key'": '"construction_year'" '"value'": 1851}

However, I don't know how to work with key-value pairs in Python, so I wanted to transform this data into a Pandas dataframe, with the "keys" as the headers and their respective values in each row, like so:

id floor available_date living_space construction_year
id 1 2. Floor nach Vereinbarung
id 2 1. Floor 81
id 3 240.0 2012
id 4 280.0 1851

I have found many instructions on how to transform a Pandas dataframe into key-value pairs, but not the other way around...

Thank you in advance.

UPDATE

The content of my data looks like this:

print(df.head(10))
           [{'key'": '"floor'"   '"value'": '"3. Stock'"}        {'"key'": '"living_space'"    '"value'": 50.0}      {'"key'": '"available_date'"  ... Unnamed: 49 Unnamed: 50 Unnamed: 51 Unnamed: 52 Unnamed: 53
0          [{'key'": '"floor'"   '"value'": '"2. Stock'"}        {'"key'": '"living_space'"   '"value'": 113.0}   {'"key'": '"construction_year'"  ...         NaN         NaN         NaN         NaN         NaN
1          [{'key'": '"floor'"   '"value'": '"1. Stock'"}        {'"key'": '"living_space'"    '"value'": 52.0}   {'"key'": '"construction_year'"  ...         NaN         NaN         NaN         NaN         NaN
..                         ...                        ...                               ...                 ...                               ...  ...         ...         ...         ...         ...         ...
8   [{'key'": '"living_space'"          '"value'": 240.0}   {'"key'": '"construction_year'"    '"value'": 2012}      {'"key'": '"available_date'"  ...         NaN         NaN         NaN         NaN         NaN
9   [{'key'": '"living_space'"          '"value'": 280.0}   {'"key'": '"construction_year'"    '"value'": 1851}      {'"key'": '"available_date'"  ...         NaN         NaN         NaN         NaN         NaN

[10 rows x 54 columns]

UPDATE

The contents of the .csv looks like the following (for the 2 first observations):

1,"[{'key'"": '""floor'"""," '""value'"": '""3. Stock'""}"," {'""key'"": '""living_space'"""," '""value'"": 50.0}"," {'""key'"": '""available_date'"""," '""value'"": '""01.04.2022'""}"," {'""key'"": '""useful_area'"""," '""value'"": 60.0}"," {'""key'"": '""pets_allowed'"""," '""value'"": true}"," {'""key'"": '""child_friendly'"""," '""value'"": true}"," {'""key'"": '""balcony'"""," '""value'"": true}"," {'""key'"": '""parking_outdoor'"""," '""value'"": true}"," {'""key'"": '""lift'"""," '""value'"": true}"," {'""key'"": '""cable_tv'"""," '""value'"": true}]""","[{'date'"": '""2022-02-25'"""," '""price_amount'"": 1550}]"""

2,"[{'key'"": '""floor'"""," '""value'"": '""2. Stock'""}"," {'""key'"": '""living_space'"""," '""value'"": 113.0}"," {'""key'"": '""construction_year'"""," '""value'"": 2022}"," {'""key'"": '""available_date'"""," '""value'"": '""01.04.2022'""}"," {'""key'"": '""wheelchair_accessible'"""," '""value'"": true}"," {'""key'"": '""child_friendly'"""," '""value'"": true}"," {'""key'"": '""balcony'"""," '""value'"": true}"," {'""key'"": '""parking_indoor'"""," '""value'"": true}"," {'""key'"": '""lift'"""," '""value'"": true}]""","[{'date'"": '""2022-02-27'"""," '""price_amount'"": 2990}]"""

The data was scrapped from real estate online marketplaces it seems. I think is also relevant to state that each observation has a different number of features.

CodePudding user response:

Possible solution is the following:

file 'data.csv' content

1,"[{'key'"": '""floor'"""," '""value'"": '""3. Stock'""}"," {'""key'"": '""living_space'"""," '""value'"": 50.0}"," {'""key'"": '""available_date'"""," '""value'"": '""01.04.2022'""}"," {'""key'"": '""useful_area'"""," '""value'"": 60.0}"," {'""key'"": '""pets_allowed'"""," '""value'"": true}"," {'""key'"": '""child_friendly'"""," '""value'"": true}"," {'""key'"": '""balcony'"""," '""value'"": true}"," {'""key'"": '""parking_outdoor'"""," '""value'"": true}"," {'""key'"": '""lift'"""," '""value'"": true}"," {'""key'"": '""cable_tv'"""," '""value'"": true}]""","[{'date'"": '""2022-02-25'"""," '""price_amount'"": 1550}]"""
2,"[{'key'"": '""floor'"""," '""value'"": '""2. Stock'""}"," {'""key'"": '""living_space'"""," '""value'"": 113.0}"," {'""key'"": '""construction_year'"""," '""value'"": 2022}"," {'""key'"": '""available_date'"""," '""value'"": '""01.04.2022'""}"," {'""key'"": '""wheelchair_accessible'"""," '""value'"": true}"," {'""key'"": '""child_friendly'"""," '""value'"": true}"," {'""key'"": '""balcony'"""," '""value'"": true}"," {'""key'"": '""parking_indoor'"""," '""value'"": true}"," {'""key'"": '""lift'"""," '""value'"": true}]""","[{'date'"": '""2022-02-27'"""," '""price_amount'"": 2990}]"""

import pandas as pd
import json

# read data from csv file
with open("data.csv", "r", encoding="utf-8") as file:
    data = file.read().replace('"', '').replace("'", '"').replace("[", '').replace("]", '').splitlines()

# convert string to list
data_dict = [json.loads("["   d   "]") for d in data]

data_all = []

for list_item in data_dict:
    data_prepared = {}
    for idx, item in enumerate(list_item):
        if idx == 0:
            data_prepared["id"] = item
        else:
            if 'key' in item:
                data_prepared[item['key']] = item['value']
            else:
                data_prepared.update(item)
    data_all.append(data_prepared)

# create dataframe
df = pd.DataFrame(data_all)
df = df.fillna("-")
df = df.replace(True, 'Yes')
df = df.replace(False, 'No')
df

Returns

enter image description here

  • Related