Home > front end >  Regex RDD using Apache Spark Scala
Regex RDD using Apache Spark Scala

Time:12-02

I have the following RDD:

x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR

I just want to get the first part of every part of this RDD as you can see in the next example:

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app

To do it I am trying it in this way.

//Here I get the RDD:
val x = spark.sparkContext.parallelize(List(value)).collect()
//Try to use regex on it, this regex is to get until the first comma
val regex1 = """(^(. ?),)"""
val rdd_1 = x.map(g => g.matches(regex1))

This is what I am trying but is not working for me because I just get an Array of Boolean. What am I doing wrong?

I am new with Apache Spark Scala. If you need something more just tell me it. Thanks in advance!

CodePudding user response:

Try with this regex :

^\s*([^,] )(_\w )?

Demo

CodePudding user response:

try this.

val x: Array[String] =
    Array(
      "Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
      "group: NT=app_1,hadoop-exec,sparkConnection,Ready",
      "group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
      "group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
      "group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")

  val rdd = sc.parallelize(x)

  val result = rdd.map(lines => {
    lines.split(",")(0)
  })

  
 result.collect().foreach(println)

output:

Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
  • Related