I'm currently trying to convert an epoch date that is returned via an API call in some JSON. I'm attempting to tidy the field up but I've tried regex's etc. and get the same error so I've tried to do something basic.
This is an example of the dataframe at the moment
-------- ---------------------
|d.userId|d.startDate |
-------- ---------------------
|60000033|/Date(1659398400000)/|
|100003 |/Date(1635724800000)/|
|2 |/Date(1672531200000)/|
|100010 |/Date(1635724800000)/|
This is the code which generates the above dataframe.
from pyspark.sql.functions import explode, col
from pyspark.sql import functions as f
jsonDF2 = jsonDF.withColumn('d', explode(col('d.results')))
jsonDF2 = jsonDF2.select(jsonDF2.d.userId, jsonDF2.d.startDate)#.show(10, False)
The following line ends up with the error.
jsonDF3 = jsonDF2.withColumn("startDate", col('startDate').replace("\Date(", "")).show(10, False)
Column Object Not Callable
CodePudding user response:
.replace is a python function not a pyspark df function use regexp_replace instead
jsonDF3 = jsonDF2.withColumn("startDate",regexp_replace(col('startDate'),"\d ","")).show(10,False)
CodePudding user response:
You try to replace
on the Column
object. Column
object does not have replace
method. These are the methods of Column
class.
You probably wanted to use this replace
function. Sadly, it does not have DataFrame API. You can only access it using expr
.
from pyspark.sql import functions as F
jsonDF2 = spark.createDataFrame(
[(60000033, '/Date(1659398400000)/'),
(100003, '/Date(1635724800000)/'),
(2, '/Date(1672531200000)/'),
(100010, '/Date(1635724800000)/')],
['d.userId', 'd.startDate']
)
jsonDF3 = jsonDF2.withColumn("startDate", F.expr("replace(`d.startDate`, '/Date(', '')"))
jsonDF3.show(truncate=0)
# -------- --------------------- ---------------
# |d.userId|d.startDate |startDate |
# -------- --------------------- ---------------
# |60000033|/Date(1659398400000)/|1659398400000)/|
# |100003 |/Date(1635724800000)/|1635724800000)/|
# |2 |/Date(1672531200000)/|1672531200000)/|
# |100010 |/Date(1635724800000)/|1635724800000)/|
# -------- --------------------- ---------------
However, regexp_extract
would probably be a better choice:
jsonDF3 = jsonDF2.withColumn("startDate", F.regexp_extract('`d.startDate`', '\d ', 0))
jsonDF3.show(truncate=0)
# -------- --------------------- -------------
# |d.userId|d.startDate |startDate |
# -------- --------------------- -------------
# |60000033|/Date(1659398400000)/|1659398400000|
# |100003 |/Date(1635724800000)/|1635724800000|
# |2 |/Date(1672531200000)/|1672531200000|
# |100010 |/Date(1635724800000)/|1635724800000|
# -------- --------------------- -------------