Home > OS >  find regex match on pyspark dataframe
find regex match on pyspark dataframe

Time:01-03

I have a pyspark datafrme like below

 --- ------------------------------------------------------------------------------- 
|id |strval                                                                         |
 --- ------------------------------------------------------------------------------- 
|1  |{"position":"left","title":"journeys","link":"https://www.abcd.com","rank":549}|
|2  |{"position":"left","title":"journeys","link":"https://www.abcd.com","rank":5}  |
 --- ------------------------------------------------------------------------------- 

I want to find only the value of rank using the regexp I have tried below code

select id,
regexp_extract("strval", '(\"rank\":)(\d )',2) as rn
from tbl

but all I am getting is empty values. Can someone please guide me to achieve this?

CodePudding user response:

You should avoid using regex here, and instead should parse your JSON content to extract the rank key. Assuming you are using PySpark 1.6 , we can try:

from pyspark.sql.functions import get_json_object

df = df.withColumn("rank", get_json_object(df.strval, f'$.rank'))

Assuming you want to continue solving this via Spark SQL, you were already on the right track. I would suggest not putting the column name in double quotes, as that might be mistaken for a string literal. Also, you only need to use one capture group, for the numerical rank:

SELECT id,
       REGEXP_EXTRACT(strval, '"rank":(\d )', 1) AS rn
FROM tbl;
  • Related