Home > Blockchain >  pyspark - Updating a column based on a calculated value from another calculated column
pyspark - Updating a column based on a calculated value from another calculated column

Time:05-30

  1. Following code loads data from csv file into dataframe df. A SQL table myTable corresponding to this df already exists and data will be imported from this df into myTable.
  2. myTable has several columns. Column5 and Column6 exists in myTable and are calculated columns. But these columns do not exist in csv file.
  3. The values of Column5 are calculated based on values from Column1. And the values of Column6 are calculated based on the calculated values from Column5. These values are calculated by testFunction1 and testFunction2 respectivley.
  4. The code works fine for Column5. But throws the following error on the last line .withColumn("Column6", newFunction2(df.Column5)) of the code below.

Question: What I may be doing wrong here. And how can we fix the error. Note: If I remove Column6 from the myTable, and remove last line of the code below, the code successfully loads the data into myTable with data in column5 filled (as intended) with calculated values from Column1.

Error:

AttributeError: 'DataFrame' object has no attribute 'Column6'

Code:

from pyspark.sql.types import StringType
from pyspark.sql import functions as F

df = spark.read.csv(".......dfs.core.windows.net/myDataFile.csv", header="true", inferSchema="false")

def testFunction1(Col1Value):
  #do some calculation on column1 value and return it to column5
  return mystr1

def testFunction2(value):
  # do some calculation on column5 value and return it to column6
  return mystr2

newFunction1 = F.udf(testFunction1, StringType())
newFunction2 = F.udf(testFunction2, StringType())

df2 = df.withColumn("Column5", newFunction1(df.Column1)) \
      .withColumn("Column6", newFunction2(df.Column5)) 

CodePudding user response:

The problem is when you are creating df2. You are reading the dataframe df and creating the column "Column5" and then referencing the column at the second line but "Column5" for df2. but "Column5" doesnt exist yet in df in which you are referencing at the second line. break up the last part into two statements such as the code below. this should solve the problem:

df2 = df.withColumn("Column5", newFunction1(df.Column1))
df3 = df2.withColumn("Column6", newFunction2(df.Column5)) 
  • Related