I have a pyspark datafrme like below
--- -------------------------------------------------------------------------------
|id |strval |
--- -------------------------------------------------------------------------------
|1 |{"position":"left","title":"journeys","link":"https://www.abcd.com","rank":549}|
|2 |{"position":"left","title":"journeys","link":"https://www.abcd.com","rank":5} |
--- -------------------------------------------------------------------------------
I want to find only the value of rank using the regexp I have tried below code
select id,
regexp_extract("strval", '(\"rank\":)(\d )',2) as rn
from tbl
but all I am getting is empty values. Can someone please guide me to achieve this?
CodePudding user response:
You should avoid using regex here, and instead should parse your JSON content to extract the rank
key. Assuming you are using PySpark 1.6 , we can try:
from pyspark.sql.functions import get_json_object
df = df.withColumn("rank", get_json_object(df.strval, f'$.rank'))
Assuming you want to continue solving this via Spark SQL, you were already on the right track. I would suggest not putting the column name in double quotes, as that might be mistaken for a string literal. Also, you only need to use one capture group, for the numerical rank:
SELECT id,
REGEXP_EXTRACT(strval, '"rank":(\d )', 1) AS rn
FROM tbl;