print statement is not recorded in log file in spark-submit in cluster mode-CodePudding

I have the following pyspark code named sample.py with print statement

import sys
from pyspark.sql import SparkSession
from pyspark.sql.types import *
import pyspark.sql.functions as f
from datetime import datetime
from time import time

if __name__ == '__main__':
    spark = SparkSession.builder.appName("Test").enableHiveSupport().getOrCreate()
    print("Print statement-1")
    schema = StructType([
        StructField("author", StringType(), False),
        StructField("title", StringType(), False),
        StructField("pages", IntegerType(), False),
        StructField("email", StringType(), False)
    ])

    data = [
        ["author1", "title1", 1, "[email protected]"],
        ["author2", "title2", 2, "[email protected]"],
        ["author3", "title3", 3, "[email protected]"],
        ["author4", "title4", 4, "[email protected]"]
    ]

    df = spark.createDataFrame(data, schema)
    print("Number of records",df.count())
    sys.exit(0)

the below spark-submit with sample.log is not printing the print statement

spark-submit --master yarn --deploy-mode cluster sample.py > sample.log

The scenario is we want to print something information in the log file so that after the spark job completes based on that the print statement in log file we will do some other actions.

Please help me on this

CodePudding user response：

The print statements will not be found in the spark-submit logs but rather in the yarn logs. When you do spark-submit you will get an application ID which looks like this application_1234567890123_12345.

Now run the following command with the application Id to get the aggregated yarn logs after the spark job has completed.

yarn logs -applicationId <applicationId>