I have the following RDD:
x: Array[String] =
Array("Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready
group: NT=app_1,hadoop-exec,sparkConnection,Ready
group: NT=app_exmpl_2,DB-exec,MDBConnection,NR
group: NT=apprexec,hadoop-exec,sparkConnection,Ready
group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR
I just want to get the first part of every part of this RDD as you can see in the next example:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app
To do it I am trying it in this way.
//Here I get the RDD:
val x = spark.sparkContext.parallelize(List(value)).collect()
//Try to use regex on it, this regex is to get until the first comma
val regex1 = """(^(. ?),)"""
val rdd_1 = x.map(g => g.matches(regex1))
This is what I am trying but is not working for me because I just get an Array of Boolean. What am I doing wrong?
I am new with Apache Spark Scala. If you need something more just tell me it. Thanks in advance!
CodePudding user response:
Try with this regex :
^\s*([^,] )(_\w )?
CodePudding user response:
try this.
val x: Array[String] =
Array(
"Et: NT=grouptoClassify,hadoop-exec,sparkConnection,Ready",
"group: NT=app_1,hadoop-exec,sparkConnection,Ready",
"group: NT=app_exmpl_2,DB-exec,MDBConnection,NR",
"group: NT=apprexec,hadoop-exec,sparkConnection,Ready",
"group: NT=nt_prblm_app,hadoop-exec,sparkConnection,NR")
val rdd = sc.parallelize(x)
val result = rdd.map(lines => {
lines.split(",")(0)
})
result.collect().foreach(println)
output:
Et: NT=grouptoClassify
group: NT=app_1
group: NT=app_exmpl_2
group: NT=apprexec
group: NT=nt_prblm_app