Home > Back-end >  Splitting object data into new columns in dataframe
Splitting object data into new columns in dataframe

Time:12-03

i have a dataframe with column business_id and attributes with thousands of rows like this:

 ------------------------------------------------------------------------------------ ------------------- 
|                                                                         attributes |        business_id|
 -------------------- --------------------------------------------------------------- ------------------- 
|{"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True".. |6iYb2HFDywm3zjuRg0q|
|{"HasTV:"False","Ambience":{'romantic': False, 'intimate': False,},"Price":"2" .... |7f4z43MHAV-l-LsRYsa|
 ------------------------------------------------------------------------------------ ------------------- 

how do create new column for each attribute with the value to the business id ? and if it's not applicable to that business id, it will specify false.

example:

 --------------- ------------- ------------------- 
|           Wifi|        Price|        business_id|
 --------------- ------------- ------------------- 
|         u'free|        False|6iYb2HFDywm3zjuRg0q|
|          False|            2|7f4z43MHAV-l-LsRYsa|
 --------------- ------------- ------------------- 

while also noting that there are some attributes with value as object in an object like:

{..."Ambience":{'romantic': False, 'intimate': False}...}

for this, if possible, i would like to tag that values to the attribute like as the column name like:

Ambience.romantic

Would someone please help me with this if you know how?

CodePudding user response:

You could apply pd.json_normalize() to each of the dicts in attributes:

df2 = pd.concat(df['attributes'].apply(pd.json_normalize).to_list(),
                keys=df.business_id)
>>> df2
                         WiFi  HasTV RestaurantsTableService Caters Price  \
business_id                                                                 
6iYb2HFDywm3zjuRg0q 0  u'free  False                    True   True   NaN   
7f4z43MHAV-l-LsRYsa 0     NaN  False                     NaN    NaN     2   

                      Ambience.romantic Ambience.intimate  
business_id                                                
6iYb2HFDywm3zjuRg0q 0               NaN               NaN  
7f4z43MHAV-l-LsRYsa 0             False             False

Note, the above follows a reproducible setup:

# first, reproducible setup
df = pd.DataFrame({
    'attributes': [
        {"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True"},
        {"HasTV":"False", "Ambience":{'romantic': False, 'intimate': False,},"Price":"2"},
    ],
    'business_id': ['6iYb2HFDywm3zjuRg0q', '7f4z43MHAV-l-LsRYsa'],
})

CodePudding user response:

As long as your attribute string is a valid JSON, you can extract a single JSON value using get_json_object like this

from pyspark.sql import functions as F

df = spark.createDataFrame([
    ('''{"WiFi":"free","HasTV":"False","RestaurantsTableService":"True","Price":"3"}''', '6iYb2HFDywm3zjuRg0q'),
    ('''{"HasTV":"False","Ambience":{"romantic": "False", "intimate": "False"},"Price":"2"}''', '7f4z43MHAV-l-LsRYsa'),
], ['attributes', 'business_id'])

(df
    .withColumn('price', F.get_json_object('attributes', '$.Price'))
    .withColumn('has_tv', F.get_json_object('attributes', '$.HasTV'))
    .select('price', 'has_tv', 'business_id')
    .show(10, False)
)

 ----- ------ ------------------- 
|price|has_tv|business_id        |
 ----- ------ ------------------- 
|3    |False |6iYb2HFDywm3zjuRg0q|
|2    |False |7f4z43MHAV-l-LsRYsa|
 ----- ------ ------------------- 
  • Related