Home > Back-end >  Extract multiple substrings from column in pyspark
Extract multiple substrings from column in pyspark

Time:04-06

I have a pyspark DataFrame with only one column as follows:

df = spark.createDataFrame(["This is AD185E000834", "U1JG97297 And ODNO926902 etc.","DIHK2975290;HI22K2390279; DSM928HK08", "there is nothing here."], "string").toDF("col1")

I would like to extract the codes in col1 to other columns like:

df.col2 = ["AD185E000834", "U1JG97297", "DIHK2975290", None]
df.col3 = [None, "ODNO926902", "HI22K2390279", None]
df.col4 = [None, None, "DSM928HK08", None]

Does anyone know how to do this? Thank you very much.

CodePudding user response:

I believe this can be shortened. Went long hand to give you my logic. Would have been easier if you laid down your logic in the question

#split string into array
df1=df.withColumn('k', split(col('col1'),'\s|\;')).withColumn('j', size('k'))

#compute maximum array length
s=df1.agg(max('j').alias('max')).distinct().collect()[0][0]


df1 =(df1.withColumn('k',expr("filter(k, x -> x rlike('^[A-Z0-9] $'))"))#Filter only non alphanumeric characters in the array
     
      #Convert resulting array into struct to allow split
      .withColumn(
    "k",
    F.struct(*[
        F.col("k")[i].alias(f"col{i 2}") for i in range(s)
    ])
))

#Split struct column in df1 and join back to df
df.join(df1.select('col1','k.*'),how='left', on='col1').show()

 -------------------- ------------ ------------ ---------- ---- 
|                col1|        col2|        col3|      col4|col5|
 -------------------- ------------ ------------ ---------- ---- 
|DIHK2975290;HI22K...| DIHK2975290|HI22K2390279|DSM928HK08|null|
|This is AD185E000834|AD185E000834|        null|      null|null|
|U1JG97297 And ODN...|   U1JG97297|  ODNO926902|      null|null|
|there is nothing ...|        null|        null|      null|null|
 -------------------- ------------ ------------ ---------- ---- 

CodePudding user response:

As you said in your comment, here we are assuming that your "codes" are strings of at least two characters composed only by uppercase letters and numbers.

That being said, as of Spark 3.1 , you can use regexp_extract_all with an expr function to create a temporary array column with all the codes, then dynamically create multiple columns for each entry of the arrays.

import pyspark.sql.functions as F

# create an array with all the identified "codes"
new_df = df.withColumn('myarray', F.expr("regexp_extract_all(col1, '([A-Z0-9]{2,})', 1)"))

# find the maximum amount of codes identified in a single string
max_array_length = new_df.withColumn('array_length', F.size('myarray')).agg({'array_length': 'max'}).collect()[0][0]
print('Max array length: {}'.format(max_array_length))

# explode the array in multiple columns
new_df.select('col1', *[new_df.myarray[i].alias('col'   str(i 2)) for i in range(max_array_length)]) \
  .show(truncate=False)



Max array length: 3
 ------------------------------------ ------------ ------------ ---------- 
|col1                                |col2        |col3        |col4      |
 ------------------------------------ ------------ ------------ ---------- 
|This is AD185E000834                |AD185E000834|null        |null      |
|U1JG97297 And ODNO926902 etc.       |U1JG97297   |ODNO926902  |null      |
|DIHK2975290;HI22K2390279; DSM928HK08|DIHK2975290 |HI22K2390279|DSM928HK08|
|there is nothing here.              |null        |null        |null      |
 ------------------------------------ ------------ ------------ ---------- 
  • Related