i have a dataframe with column business_id and attributes with thousands of rows like this:
------------------------------------------------------------------------------------ -------------------
| attributes | business_id|
-------------------- --------------------------------------------------------------- -------------------
|{"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True".. |6iYb2HFDywm3zjuRg0q|
|{"HasTV:"False","Ambience":{'romantic': False, 'intimate': False,},"Price":"2" .... |7f4z43MHAV-l-LsRYsa|
------------------------------------------------------------------------------------ -------------------
how do create new column for each attribute with the value to the business id ? and if it's not applicable to that business id, it will specify false.
example:
--------------- ------------- -------------------
| Wifi| Price| business_id|
--------------- ------------- -------------------
| u'free| False|6iYb2HFDywm3zjuRg0q|
| False| 2|7f4z43MHAV-l-LsRYsa|
--------------- ------------- -------------------
while also noting that there are some attributes with value as object in an object like:
{..."Ambience":{'romantic': False, 'intimate': False}...}
for this, if possible, i would like to tag that values to the attribute like as the column name like:
Ambience.romantic
Would someone please help me with this if you know how?
CodePudding user response:
You could apply pd.json_normalize()
to each of the dict
s in attributes
:
df2 = pd.concat(df['attributes'].apply(pd.json_normalize).to_list(),
keys=df.business_id)
>>> df2
WiFi HasTV RestaurantsTableService Caters Price \
business_id
6iYb2HFDywm3zjuRg0q 0 u'free False True True NaN
7f4z43MHAV-l-LsRYsa 0 NaN False NaN NaN 2
Ambience.romantic Ambience.intimate
business_id
6iYb2HFDywm3zjuRg0q 0 NaN NaN
7f4z43MHAV-l-LsRYsa 0 False False
Note, the above follows a reproducible setup:
# first, reproducible setup
df = pd.DataFrame({
'attributes': [
{"WiFi":"u'free","HasTV":"False","RestaurantsTableService":"True","Caters":"True"},
{"HasTV":"False", "Ambience":{'romantic': False, 'intimate': False,},"Price":"2"},
],
'business_id': ['6iYb2HFDywm3zjuRg0q', '7f4z43MHAV-l-LsRYsa'],
})
CodePudding user response:
As long as your attribute
string is a valid JSON, you can extract a single JSON value using get_json_object
like this
from pyspark.sql import functions as F
df = spark.createDataFrame([
('''{"WiFi":"free","HasTV":"False","RestaurantsTableService":"True","Price":"3"}''', '6iYb2HFDywm3zjuRg0q'),
('''{"HasTV":"False","Ambience":{"romantic": "False", "intimate": "False"},"Price":"2"}''', '7f4z43MHAV-l-LsRYsa'),
], ['attributes', 'business_id'])
(df
.withColumn('price', F.get_json_object('attributes', '$.Price'))
.withColumn('has_tv', F.get_json_object('attributes', '$.HasTV'))
.select('price', 'has_tv', 'business_id')
.show(10, False)
)
----- ------ -------------------
|price|has_tv|business_id |
----- ------ -------------------
|3 |False |6iYb2HFDywm3zjuRg0q|
|2 |False |7f4z43MHAV-l-LsRYsa|
----- ------ -------------------