converting column data to uppercase in pyspark-CodePudding

I have a data set as df in thich i have

country| indicator|date|year&week| value

as columns name, I want to convert data of only country column to upper case using pyspark (only data not heading) i tried

import pyspark.sql.functions as f

df.select("*", f.upper("country"))
display(df)

but it has error 'NoneType' object has no attribute 'select'

CodePudding user response：

I would have not used select because select does not change the dataframe it gives a new dataframe with an added column of your resulting function data.

I used withColumn and it works just fine, please refer to the following code snippet:

import pyspark.sql.functions as f
import pandas as pd

# Sample Data
data = {
  "country": ["United States", "Canada", "spain", "germany"],
  "indicator": ["1", "2", "3", "4"],
  "date": ["2022/01/01", "2021/01/01", "2020/01/01", "2019/01/01"],
  "year&week": ["2022-52", "2021-34", "2020-32", "2019-45"],
  "value": ["56", "28", "258", "425"]
}
df = pd.DataFrame.from_dict(data)
# Convert to spark dataframe
df = spark.createDataFrame(df)
# Apply your function to the column you choose
df = df.withColumn("country", f.upper(f.col("country")))

Now you can check with df.show() or display(df) and you'll get the following output:

df.show()
 ------------- --------- ---------- --------- ----- 
|      country|indicator|      date|year&week|value|
 ------------- --------- ---------- --------- ----- 
|UNITED STATES|        1|2022/01/01|  2022-52|   56|
|       CANADA|        2|2021/01/01|  2021-34|   28|
|        SPAIN|        3|2020/01/01|  2020-32|  258|
|      GERMANY|        4|2019/01/01|  2019-45|  425|
 ------------- --------- ---------- --------- -----

CodePudding user response：

simpleData = [["Canada","Y"],["Spain","N"],["Brazil","Y"], ["Japan","Y"],["India","N"] ]

df = spark.createDataFrame(simpleData,["country","indicator"])

#input

display(df)

import pyspark.sql.functions as f

upperDf=df.withColumn("country", f.upper("country"))

#output

display(upperDf)