Separate string column into number and C/D (credit/debit)-CodePudding

I have the below data:

 ----- -------------------- ----------- 
|Sr No|             User Id|Transaction|
 ----- -------------------- ----------- 
|1    |paytm 111002203@p...|       100D|
|2    |paytm 111002203@p...|        50C|
|3    |paytm 111002203@p...|        20C|
|4    |paytm 111002203@p...|        10C|
|5    |                null|         1C|
 ----- -------------------- -----------

I need to separate the Transaction column into Amount and CreditOrDebit. I tried this:

df_sample.withColumn('CreditOrDebit',substring('Transaction',-1,1)).withColumn('Amount',substring('Transaction',-2,-4)).show()

#  ----- -------------------- ----------- ------------- ------ 
# |Sr No|             User Id|Transaction|CreditOrDebit|Amount|
#  ----- -------------------- ----------- ------------- ------ 
# |    1|paytm 111002203@p...|       100D|            D|      |
# |    2|paytm 111002203@p...|        50C|            C|      |
# |    3|paytm 111002203@p...|        20C|            C|      |
# |    4|paytm 111002203@p...|        10C|            C|      |
# |    5|                null|         1C|            C|      |
#  ----- -------------------- ----------- ------------- ------

So I tried to use instr to get the length of the first occurrence of 'D' or 'C':

df_sample.withColumn('CreditOrDebit',substring('Transaction',-1,1)).withColumn('Amount',substring('Transaction',1,instr('Transaction','C' or 'D'))).show()

I got error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Users\Aniket\SPARKHOME\python\pyspark\sql\functions.py", line 1668, in substring
    return Column(sc._jvm.functions.substring(_to_java_column(str), pos, len))
  File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1296, in __call__
  File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1260, in _build_args
  File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1247, in _get_args
  File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_collections.py", line 510, in convert
  File "C:\Users\Aniket\SPARKHOME\python\pyspark\sql\column.py", line 353, in __iter__
    raise TypeError("Column is not iterable")
TypeError: Column is not iterable

So I tried with just 'C' as I was sure INSTR wouldn't support or inside it and got the same error.

CodePudding user response：

You can use the regexp_extract function to extract numbers and letters, respectively.

df = df.withColumn('CreditOrDebit', F.regexp_extract('Transaction', '[A-Z] ', 0)) \
    .withColumn('Amount', F.regexp_extract('Transaction', '[0-9] ', 0))
df.show(truncate=False)

CodePudding user response：

You could reuse "CreditOrDebit" column which you had already created. substring_index finds a string (C or D) and returns the result before it.

df_sample.withColumn('CreditOrDebit', substring('Transaction', -1, 1)) \
         .withColumn('Amount', expr("substring_index(Transaction, CreditOrDebit, 1)")) \
.show()
#  ----------- ------------- ------ 
# |Transaction|CreditOrDebit|Amount|
#  ----------- ------------- ------ 
# |       100D|            D|   100|
# |        50C|            C|    50|
# |        20C|            C|    20|
# |        10C|            C|    10|
# |         1C|            C|     1|
#  ----------- ------------- ------

CodePudding user response：

you can use regexp_extract and when otherwise

>>> from pyspark.sql.functions import when
>>> from pyspark.sql.functions import col
>>> df1=df.withColumn("Transaction_type",when(col("Transaction").contains("D"),"D").otherwise("C"))
>>> df1.show()
 ----- ----------------- ----------- ---------------- 
|Sr No|          User Id|Transaction|Transaction_type|
 ----- ----------------- ----------- ---------------- 
|    1|paytm 111002203@p|       100D|               D|
|    2|paytm 111002203@p|        50C|               C|
|    3|paytm 111002203@p|        20C|               C|
|    4|paytm 111002203@p|        10C|               C|
|    5|             null|         1C|               C|
 ----- ----------------- ----------- ---------------- 

>>> df1.withColumn('Transaction_amount', regexp_extract(col('Transaction'),'(\d )',1)).show()
 ----- ----------------- ----------- ---------------- ------------------ 
|Sr No|          User Id|Transaction|Transaction_type|Transaction_amount|
 ----- ----------------- ----------- ---------------- ------------------ 
|    1|paytm 111002203@p|       100D|               D|               100|
|    2|paytm 111002203@p|        50C|               C|                50|
|    3|paytm 111002203@p|        20C|               C|                20|
|    4|paytm 111002203@p|        10C|               C|                10|
|    5|             null|         1C|               C|                 1|
 ----- ----------------- ----------- ---------------- ------------------