find regex match on pyspark dataframe-CodePudding

I have a pyspark datafrme like below

 --- ------------------------------------------------------------------------------- 
|id |strval                                                                         |
 --- ------------------------------------------------------------------------------- 
|1  |{"position":"left","title":"journeys","link":"https://www.abcd.com","rank":549}|
|2  |{"position":"left","title":"journeys","link":"https://www.abcd.com","rank":5}  |
 --- -------------------------------------------------------------------------------

I want to find only the value of rank using the regexp I have tried below code

select id,
regexp_extract("strval", '(\"rank\":)(\d )',2) as rn
from tbl

but all I am getting is empty values. Can someone please guide me to achieve this?

CodePudding user response：

You should avoid using regex here, and instead should parse your JSON content to extract the rank key. Assuming you are using PySpark 1.6 , we can try:

from pyspark.sql.functions import get_json_object

df = df.withColumn("rank", get_json_object(df.strval, f'$.rank'))

Assuming you want to continue solving this via Spark SQL, you were already on the right track. I would suggest not putting the column name in double quotes, as that might be mistaken for a string literal. Also, you only need to use one capture group, for the numerical rank:

SELECT id,
       REGEXP_EXTRACT(strval, '"rank":(\d )', 1) AS rn
FROM tbl;