Home > database >  Spark regex 'COIN' in column values -> rlike approach
Spark regex 'COIN' in column values -> rlike approach

Time:10-19

I would like to check if the column values contains 'COIN' etc. in values. Is there a possibility to change my regex so as not to include "CRYPTOCOIN|KUCOIN|COINBASE"? I'd like to have something like
"regex associated with COIN word|BTCBIT.NET"

Please find my attached code below:

val CRYPTO_CARD_INDICATOR: String = ("BTCBIT.NET|KUCOIN|COINBASE|CRYPTCOIN")
val CryptoCheckDataset = df.withColumn("is_crypto_indicator",when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))

CodePudding user response:

I think the following should work:

COIN|BTCBIT.NET

Full test in PySpark:

from pyspark.sql.functions import *
CRYPTO_CARD_INDICATOR = "COIN|BTCBIT.NET"
df = spark.createDataFrame([('kucoin',), ('coinbase',), ('crypto',)], ['company_name'])

CryptoCheckDataset = df.withColumn("is_crypto_indicator", when(upper(col("company_name")).rlike(CRYPTO_CARD_INDICATOR), 1).otherwise(0))
CryptoCheckDataset.show()
#  ------------ ------------------- 
# |company_name|is_crypto_indicator|
#  ------------ ------------------- 
# |      kucoin|                  1|
# |    coinbase|                  1|
# |      crypto|                  0|
#  ------------ ------------------- 
  • Related