Home > OS >  Use python module function with spark dataframe
Use python module function with spark dataframe

Time:01-19

I would like to create a new column in a pyspark dataframe using a function from an external python module I've installed.

For example, I want to use a function get_tld from a module publicsuffix2 , which extracts the public suffix from a domain using the Public Suffix List.

My current solution uses a udf:

import pyspark.sql.functions as F
from publicsuffix2 import get_tld

df = spark.createDataFrame([{"domain": "stackoverflow.com"},
                            {"domain": "wikipedia.org"},
                            {"domain": "google.invalid"}])

get_public_suffix = F.udf(lambda x: get_tld(x))
df.withColumn("public suffix", get_public_suffix(F.col("domain"))).show()

| domain            | public suffix |
| ----------------- | ------------- |
| stackoverflow.com | com           |
| wikipedia.org     | org           |
| google.invalid    | null          |

My questions are:

  1. Is there a way to accomplish this without a udf?
  2. If not, is there anything I can do to improve the efficiency of this operation and what are some best practices to follow when using external modules/libraries such as this?

CodePudding user response:

    df = spark.createDataFrame([{"domain": "stackoverflow.com"},{"domain": "wikipedia.org"},{"domain": "google.invalid"}])
    import pyspark.sql.functions as f

    suf=df.withColumn("public suffix",f.split("domain","\.")[1])
    display(suf)
Output:
| domain            | public suffix|
| ----------------- | ------------ |
| stackoverflow.com | com          |
| wikipedia.org     | org          |
| google.invalid    | invalid      |

CodePudding user response:

UDFs are Blackbox — Don’t Use Them Unless You’ve Got No Choice

A thump of rule that I always follow when working with spark data frames and transformations, is to avoid using UDFs as much as possible. Using UDFs will kill the performance and the parallelism of spark as it needs to serialize each row and send it to the python runtime.

In the above question, the same logic can be applied using native spark functions like F.split, below is the code on how to implement that using native spark methods.

df_suffix = df.withColumn("pub suffix",F.split("domain","\.")[1])

Usually, when I face issues where I need to use UDFs I ask myself the below questions:

  • Is there a pyspark function, or combination of functions, that will solve my problem?
  • Is there a SQL function for this purpose?
  • Related