I am very new to spark, trying to find max value from array of string but getting errors. Tried couple of things like creating dataframe/split/using lit but facing further errors. Can anyone please help me.
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
from pyspark.sql.functions import max
from pyspark.sql.types import StructType, StructField, StringType,IntegerType,TimestampType,ArrayType
from datetime import datetime
new_array: list = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22']
df = max(new_array) #error in this line
df.show()
df.printSchema()
Error :
Invalid argument, not a string or column: ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22'] of type . For column literals, use 'lit', 'array', 'struct' or 'create_map' function.
Thanks a lot in Advance
CodePudding user response:
Issue with your code:
- The return value of the function of max is not a Spark dataframe.
- Just use
print(max(numList))
to get the max value, Spark didn't participate in this process.
Try this:
from pyspark.sql import functions as F
import pandas as pd
numList = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22']
pdDF = pd.DataFrame(numList, columns=['num'])
df = spark.createDataFrame(pdDF)
df.select(F.max(df.num)).show()
# Of course, you can also use chain programming sign '.' to link the above code.
spark \
.createDataFrame(pd.DataFrame(numList, columns=['num'])) \
.select(F.max(F.col('num'))) \
.show()
Output:
--------
|max(num)|
--------
| 22|
--------
Step:
- Create pandas dataframe from list, docs here.
- Create Spark dataframe from pandas dataframe, docs here.
- Use
max()
function inselect
clause.
I hope this will help you.
CodePudding user response:
Just wanted to share update, I am able to get desired results from below code.
from pyspark.sql import functions as F
new_array: list = ['00', '01', '02', '03', '04', '05', '06', '07', '08', '09', '10', '11', '12', '13', '17', '18', '19', '20', '22']
df = spark.createDataFrame([(i,) for i in new_array], ["new_array"])
df.select(F.max(df.new_array)).show()
max_val = df.select(F.max(df.new_array).alias("maxval")).first().__getitem__('maxval')
print(max_val)
Output :
--------------
|max(new_array)|
--------------
| 22|
--------------
22