I am trying to extract multiple sequences of numerical characters out of a long string. With regexp_extract, I am able to extract the first sequence, but how can I extract the following ones and write them into new columns (one column for every sequence)?
df.withColumn('IDENT1', F.df(bookings.IDENTIFIER, "(\d )", 0))
----------------------------------------------------------------------------- ----------
|IDENTIFIER |IDENT1 |
----------------------------------------------------------------------------- ----------
|31010833 - 9911075145 - some more random text |31010833 |
|TL/9910900047/asdiqwoidw/more text// 23231263264 |9910900047|
----------------------------------------------------------------------------- ----------
CodePudding user response:
I would use a UDF to extract the numbers, put them in an array and then explode
that array to columns, or just do the select
.
Example:
import re
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.types import ArrayType, StringType
def extract(text):
all_matches = re.findall(r"-?\d(?:[,\d]*\.\d |[,\d]*)", text)
return all_matches
extract_udf = F.udf(extract, ArrayType(StringType()))
spark = SparkSession.builder.getOrCreate()
data = [
{"test": "31010833 - 9911075145 - some more random text"},
{"test": "TL/9910900047/asdiqwoidw/more text// 23231263264 "},
]
df = spark.createDataFrame(data=data)
df = df.withColumn("numbers", extract_udf("test"))
df = df.select(["test"] [df.numbers[i] for i in range(2)])
df.show(10, False)
df.printSchema()
Result:
------------------------------------------------- ---------- -----------
|test |numbers[0]|numbers[1] |
------------------------------------------------- ---------- -----------
|31010833 - 9911075145 - some more random text |31010833 |9911075145 |
|TL/9910900047/asdiqwoidw/more text// 23231263264 |9910900047|23231263264|
------------------------------------------------- ---------- -----------
root
|-- test: string (nullable = true)
|-- numbers[0]: string (nullable = true)
|-- numbers[1]: string (nullable = true)
Also, if you use Spark 3.1 you can use regexp_extract_all
. See question: Extracting several regex matches in PySpark