Home > Software engineering >  HDP 2.6.5 : I create a database in python with pyspark but the database don't appear in hive vi
HDP 2.6.5 : I create a database in python with pyspark but the database don't appear in hive vi

Time:12-18

I have a problem with my database created with pyspark:

from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, SparkSession, HiveContext
                                                                                                                                                                        
conf = SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
                                                                                                                                                                        
                                                                                                                                                                        
spark = SparkSession.builder \
        .appName("test") \
        .getOrCreate()

df = spark.read.option("multiline", "true").json("hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/output2.json")
                                                                                                                                                                        
#Database on Hive                                                                                                                                                       
spark.sql("create database testdb")

print('Voici le Dataframe : ', df)

df.write.mode("overwrite").saveAsTable("testdb.test3")

spark.sql("SHOW DATABASES").show()

df = spark.sql("SELECT * FROM testdb.test3")
df.show()

I have no errors with this code, but when I go to the ambari interface I don't see my database: there are only "default" and "foodmart". When in the shell with:

spark.sql("SHOW DATABASES").show()
df = spark.sql("SELECT * FROM testdb.test3")

I can see my database and its content like this:

 --------- 
|namespace|                                                                                                                                  ---------                                                                                                                                  |  default|                                                                                                                                 |   testdb|                                                                                                                                  --------- 

What can I do to see my database on hive and to use it on zeppelin.

I try with maria_dev and root. I tried to restart hive.

CodePudding user response:

So you have not set the .enableHiveSupport() while creating the SparkSession. Currently, the database is getting stored in a default data_warehouse location. Whenever you start the shell it reads from that data_warehoues (the table that you saved). So to save the data into hive you can do the following:

from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, SparkSession, HiveContext
                                                                                                                                                                        
conf = SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
                                                                                                                                                                        
                                                                                                                                                                        
spark = SparkSession.builder \
        .appName("test") \
        .enableHiveSupport() \ # This will use the hive to store the data while using spark sql.
        .getOrCreate()

df = spark.read.option("multiline", "true").json("hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/output2.json")
                                                                                                                                                                        
#Database on Hive                                                                                                                                                       
spark.sql("create database testdb")

print('Voici le Dataframe : ', df)

df.write.mode("overwrite").saveAsTable("testdb.test3")

spark.sql("SHOW DATABASES").show()

df = spark.sql("SELECT * FROM testdb.test3")
df.show()

CodePudding user response:

thank you very much for your answer. I try to add this line, but now I have another errors :

22/12/17 16:39:26 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
22/12/17 16:39:26 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
22/12/17 16:39:29 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recordi ng the schema version 2.3.0
22/12/17 16:39:29 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by Me taStore [email protected]
22/12/17 16:39:29 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
22/12/17 16:39:30 WARN ObjectStore: Failed to get database testdb2, returning NoSuchObjectException 22/12/17 16:39:34 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of HiveAuthorizerFactory.
22/12/17 16:39:35 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
22/12/17 16:39:35 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
22/12/17 16:39:35 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist

I can found my database on spark_wharehouse in filezilla but not in hive (ambari interface)

  • Related