I have the below data:
----- -------------------- -----------
|Sr No| User Id|Transaction|
----- -------------------- -----------
|1 |paytm 111002203@p...| 100D|
|2 |paytm 111002203@p...| 50C|
|3 |paytm 111002203@p...| 20C|
|4 |paytm 111002203@p...| 10C|
|5 | null| 1C|
----- -------------------- -----------
I need to separate the Transaction column into Amount and CreditOrDebit. I tried this:
df_sample.withColumn('CreditOrDebit',substring('Transaction',-1,1)).withColumn('Amount',substring('Transaction',-2,-4)).show()
# ----- -------------------- ----------- ------------- ------
# |Sr No| User Id|Transaction|CreditOrDebit|Amount|
# ----- -------------------- ----------- ------------- ------
# | 1|paytm 111002203@p...| 100D| D| |
# | 2|paytm 111002203@p...| 50C| C| |
# | 3|paytm 111002203@p...| 20C| C| |
# | 4|paytm 111002203@p...| 10C| C| |
# | 5| null| 1C| C| |
# ----- -------------------- ----------- ------------- ------
So I tried to use instr
to get the length of the first occurrence of 'D' or 'C':
df_sample.withColumn('CreditOrDebit',substring('Transaction',-1,1)).withColumn('Amount',substring('Transaction',1,instr('Transaction','C' or 'D'))).show()
I got error:
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "C:\Users\Aniket\SPARKHOME\python\pyspark\sql\functions.py", line 1668, in substring
return Column(sc._jvm.functions.substring(_to_java_column(str), pos, len))
File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1296, in __call__
File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1260, in _build_args
File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_gateway.py", line 1247, in _get_args
File "C:\Users\Aniket\SPARKHOME\python\lib\py4j-0.10.9-src.zip\py4j\java_collections.py", line 510, in convert
File "C:\Users\Aniket\SPARKHOME\python\pyspark\sql\column.py", line 353, in __iter__
raise TypeError("Column is not iterable")
TypeError: Column is not iterable
So I tried with just 'C' as I was sure INSTR
wouldn't support or inside it and got the same error.
CodePudding user response:
You can use the regexp_extract
function to extract numbers and letters, respectively.
df = df.withColumn('CreditOrDebit', F.regexp_extract('Transaction', '[A-Z] ', 0)) \
.withColumn('Amount', F.regexp_extract('Transaction', '[0-9] ', 0))
df.show(truncate=False)
CodePudding user response:
You could reuse "CreditOrDebit" column which you had already created. substring_index
finds a string (C or D) and returns the result before it.
df_sample.withColumn('CreditOrDebit', substring('Transaction', -1, 1)) \
.withColumn('Amount', expr("substring_index(Transaction, CreditOrDebit, 1)")) \
.show()
# ----------- ------------- ------
# |Transaction|CreditOrDebit|Amount|
# ----------- ------------- ------
# | 100D| D| 100|
# | 50C| C| 50|
# | 20C| C| 20|
# | 10C| C| 10|
# | 1C| C| 1|
# ----------- ------------- ------
CodePudding user response:
you can use regexp_extract and when otherwise
>>> from pyspark.sql.functions import when
>>> from pyspark.sql.functions import col
>>> df1=df.withColumn("Transaction_type",when(col("Transaction").contains("D"),"D").otherwise("C"))
>>> df1.show()
----- ----------------- ----------- ----------------
|Sr No| User Id|Transaction|Transaction_type|
----- ----------------- ----------- ----------------
| 1|paytm 111002203@p| 100D| D|
| 2|paytm 111002203@p| 50C| C|
| 3|paytm 111002203@p| 20C| C|
| 4|paytm 111002203@p| 10C| C|
| 5| null| 1C| C|
----- ----------------- ----------- ----------------
>>> df1.withColumn('Transaction_amount', regexp_extract(col('Transaction'),'(\d )',1)).show()
----- ----------------- ----------- ---------------- ------------------
|Sr No| User Id|Transaction|Transaction_type|Transaction_amount|
----- ----------------- ----------- ---------------- ------------------
| 1|paytm 111002203@p| 100D| D| 100|
| 2|paytm 111002203@p| 50C| C| 50|
| 3|paytm 111002203@p| 20C| C| 20|
| 4|paytm 111002203@p| 10C| C| 10|
| 5| null| 1C| C| 1|
----- ----------------- ----------- ---------------- ------------------