I have a problem with my database created with pyspark:
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, SparkSession, HiveContext
conf = SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
spark = SparkSession.builder \
.appName("test") \
.getOrCreate()
df = spark.read.option("multiline", "true").json("hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/output2.json")
#Database on Hive
spark.sql("create database testdb")
print('Voici le Dataframe : ', df)
df.write.mode("overwrite").saveAsTable("testdb.test3")
spark.sql("SHOW DATABASES").show()
df = spark.sql("SELECT * FROM testdb.test3")
df.show()
I have no errors with this code, but when I go to the ambari interface I don't see my database: there are only "default" and "foodmart". When in the shell with:
spark.sql("SHOW DATABASES").show()
df = spark.sql("SELECT * FROM testdb.test3")
I can see my database and its content like this:
---------
|namespace| --------- | default| | testdb| ---------
What can I do to see my database on hive and to use it on zeppelin.
I try with maria_dev and root. I tried to restart hive.
CodePudding user response:
So you have not set the .enableHiveSupport()
while creating the SparkSession
. Currently, the database is getting stored in a default data_warehouse location. Whenever you start the shell it reads from that data_warehoues (the table that you saved). So to save the data into hive you can do the following:
from pyspark import SparkContext,SparkConf
from pyspark.sql import SQLContext, SparkSession, HiveContext
conf = SparkConf().set("spark.jars.packages","org.mongodb.spark:mongo-spark-connector_2.11:2.3.2")
spark = SparkSession.builder \
.appName("test") \
.enableHiveSupport() \ # This will use the hive to store the data while using spark sql.
.getOrCreate()
df = spark.read.option("multiline", "true").json("hdfs://sandbox-hdp.hortonworks.com:8020/user/maria_dev/output2.json")
#Database on Hive
spark.sql("create database testdb")
print('Voici le Dataframe : ', df)
df.write.mode("overwrite").saveAsTable("testdb.test3")
spark.sql("SHOW DATABASES").show()
df = spark.sql("SELECT * FROM testdb.test3")
df.show()
CodePudding user response:
thank you very much for your answer. I try to add this line, but now I have another errors :
22/12/17 16:39:26 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
22/12/17 16:39:26 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
22/12/17 16:39:29 WARN ObjectStore: Version information not found in metastore. hive.metastore.schema.verification is not enabled so recordi
ng the schema version 2.3.0
22/12/17 16:39:29 WARN ObjectStore: setMetaStoreSchemaVersion called but recording version is disabled: version = 2.3.0, comment = Set by Me
taStore [email protected]
22/12/17 16:39:29 WARN ObjectStore: Failed to get database global_temp, returning NoSuchObjectException
22/12/17 16:39:30 WARN ObjectStore: Failed to get database testdb2, returning NoSuchObjectException 22/12/17 16:39:34 WARN SessionState: METASTORE_FILTER_HOOK will be ignored, since hive.security.authorization.manager is set to instance of
HiveAuthorizerFactory.
22/12/17 16:39:35 WARN HiveConf: HiveConf of name hive.internal.ss.authz.settings.applied.marker does not exist
22/12/17 16:39:35 WARN HiveConf: HiveConf of name hive.stats.jdbc.timeout does not exist
22/12/17 16:39:35 WARN HiveConf: HiveConf of name hive.stats.retries.wait does not exist
I can found my database on spark_wharehouse in filezilla but not in hive (ambari interface)