Home > Blockchain >  Max value from array of string in pyspark
Max value from array of string in pyspark

Time:01-19

I am very new to spark, trying to find max value from array of string but getting errors. Tried couple of things like creating dataframe/split/using lit but facing further errors. Can anyone please help me.

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import max
from pyspark.sql.types import StructType, StructField, StringType,IntegerType,TimestampType,ArrayType
from datetime import datetime

new_array: list = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22']

df = max(new_array) #error in this line
df.show()
df.printSchema()

Error :

Invalid argument, not a string or column: ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.

Thanks a lot in Advance

CodePudding user response:

Issue with your code:

  • The return value of the function of max is not a Spark dataframe.
  • Just use print(max(numList)) to get the max value, Spark didn't participate in this process.

Try this:

from pyspark.sql import functions as F
import pandas as pd

numList = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22']

pdDF = pd.DataFrame(numList, columns=['num'])
df = spark.createDataFrame(pdDF)
df.select(F.max(df.num)).show()

# Of course, you can also use chain programming sign '.' to link the above code.
spark \
    .createDataFrame(pd.DataFrame(numList, columns=['num'])) \
    .select(F.max(F.col('num'))) \
    .show()

Output:

 -------- 
|max(num)|
 -------- 
|      22|
 -------- 

Step:

  • Create pandas dataframe from list, docs here.
  • Create Spark dataframe from pandas dataframe, docs here.
  • Use max() function in select clause.

I hope this will help you.

CodePudding user response:

Just wanted to share update, I am able to get desired results from below code.

from pyspark.sql import functions as F

new_array: list = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22']

df = spark.createDataFrame([(i,) for i in new_array], ["new_array"])
df.select(F.max(df.new_array)).show()
max_val = df.select(F.max(df.new_array).alias("maxval")).first().__getitem__('maxval')
print(max_val)

Output :

 -------------- 
|max(new_array)|
 -------------- 
|            22|
 -------------- 

22
  • Related