How to implement this using pyspark?
Given an Input data set with name, age and city, if age > 18 add a new column that’s populated with ‘Y’ else ‘N’.
- List item
solve this using apache pyspark
input text file:
sumit,30,bangalore
kapil,32,hyderabad
sathish,16,chennai
ravi,39,bangalore
kavita,12,hyderabad
kavya,19,mysore
output:
sumit,30,bangalore,Y
kapil,32,hyderabad,Y
sathish,16,chennai,N
ravi,39,bangalore,Y
kavita,12,hyderabad,N
kavya,19,mysore,Y
CodePudding user response:
I guess that you should
- create PySpark DataFrame from text file
from pyspark.sql import SparkSession
spark = SparkSession.builder.appName("MyApp").getOrCreate()
df = spark.read.format("csv").option("header", "false").load("input.txt")
- split the input data into separate columns and rename them
df = df.withColumn("name", df[0])
df = df.withColumn("age", df[1].cast("int"))
df = df.withColumn("city", df[2])
df = df.withColumnRenamed("_1", "name").withColumnRenamed("_2", "age").withColumnRenamed("_3", "city")
- add a new column to the DataFrame based on
age
and write output
df = df.withColumn("eligible",
(df["age"] > 18).cast("string").when(df["age"] > 18, "Y").otherwise("N"))
df.write.format("csv").option("header", "false").save("output.txt")
CodePudding user response:
This is could be done using when/otherwhise with spark:
spark = SparkSession.builder.master("local[*]").getOrCreate()
data = [
["sumit", 30, "bangalore"],
["kapil", 32, "hyderabad"],
["sathish", 16, "chennai"],
["ravi", 39, "bangalore"],
["kavita", 12, "hyderabad"],
["kavya", 19, "mysore"],
]
df = spark.createDataFrame(data).toDF("name", "age", "city")
result = df.withColumn("result", when(df.age > 18, "Y").otherwise("N"))
result.show()
------- --- --------- ------
| name|age| city|result|
------- --- --------- ------
| sumit| 30|bangalore| Y|
| kapil| 32|hyderabad| Y|
|sathish| 16| chennai| N|
| ravi| 39|bangalore| Y|
| kavita| 12|hyderabad| N|
| kavya| 19| mysore| Y|
------- --- --------- ------