How to implement this using pyspark?

Given an Input data set with name, age and city, if age > 18 add a new column that’s populated with ‘Y’ else ‘N’.

List item

solve this using apache pyspark

input text file:

sumit,30,bangalore
kapil,32,hyderabad
sathish,16,chennai
ravi,39,bangalore
kavita,12,hyderabad
kavya,19,mysore

output:

sumit,30,bangalore,Y
kapil,32,hyderabad,Y
sathish,16,chennai,N
ravi,39,bangalore,Y
kavita,12,hyderabad,N
kavya,19,mysore,Y

CodePudding user response：

I guess that you should

create PySpark DataFrame from text file

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("MyApp").getOrCreate()

df = spark.read.format("csv").option("header", "false").load("input.txt")

split the input data into separate columns and rename them

df = df.withColumn("name", df[0])
df = df.withColumn("age", df[1].cast("int"))
df = df.withColumn("city", df[2])

df = df.withColumnRenamed("_1", "name").withColumnRenamed("_2", "age").withColumnRenamed("_3", "city")

add a new column to the DataFrame based on age and write output

df = df.withColumn("eligible",
                   (df["age"] > 18).cast("string").when(df["age"] > 18, "Y").otherwise("N"))

df.write.format("csv").option("header", "false").save("output.txt")

CodePudding user response：

This is could be done using when/otherwhise with spark:

spark = SparkSession.builder.master("local[*]").getOrCreate()

data = [
    ["sumit", 30, "bangalore"],
    ["kapil", 32, "hyderabad"],
    ["sathish", 16, "chennai"],
    ["ravi", 39, "bangalore"],
    ["kavita", 12, "hyderabad"],
    ["kavya", 19, "mysore"],
]
df = spark.createDataFrame(data).toDF("name", "age", "city")
result = df.withColumn("result", when(df.age > 18, "Y").otherwise("N"))
result.show()


 ------- --- --------- ------ 
|   name|age|     city|result|
 ------- --- --------- ------ 
|  sumit| 30|bangalore|     Y|
|  kapil| 32|hyderabad|     Y|
|sathish| 16|  chennai|     N|
|   ravi| 39|bangalore|     Y|
| kavita| 12|hyderabad|     N|
|  kavya| 19|   mysore|     Y|
 ------- --- --------- ------