Home > database >  Best Practice when Writing PySpark DataFrames to MySQL
Best Practice when Writing PySpark DataFrames to MySQL

Time:10-29

I am trying to develop a few data pipelines using Apache Airflow with scheduled Spark jobs.

For one of these pipelines, I am trying to write data from a PySpark DataFrame to MySQL and I keep running into a few problems. This is simply what my code looks like for now, but I do want to add more transformation to this in the future,

df_tsv = spark.read.csv(tsv_file, sep=r'\t', header=True)
df_tsv.write.jdbc(url=mysql_url, table=mysql_table, mode="append", properties={"user":mysql_user, "password": mysql_password, "driver": "com.mysql.cj.jdbc.Driver" })

This is the exception that keeps getting raised,

java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver

The first thing that I want to know is how I can solve the above issue.

Secondly, I would like to know what the best practice is when writing data from Spark to databases like MySQL. For instance, is there an option to make it so that data from a given column in the DataFrame is stored in a specified column in the table? Or should the column names of the table be the same as those of the DataFrame?

The other option that I can think of here is to convert the DataFrame to say, a list of tuples and then use something like the mysql-python-connector to load the data into the database,

rdd = df.rdd
b = rdd.map(tuple)
data = b.collect()

# write data to database using mysql-python-connector

What is the more efficient option here? Are there any other options that I do not know of?

CodePudding user response:

java.lang.ClassNotFoundException: com.mysql.cj.jdbc.Driver

The first thing that I want to know is how I can solve the above issue.

You need to pass the JDBC connector when starting your Spark session https://spark.apache.org/docs/latest/sql-data-sources-jdbc.html.

Secondly, I would like to know what the best practice is when writing data from Spark to databases like MySQL. For instance, is there an option to make it so that data from a given column in the DataFrame is stored in a specified column in the table? Or should the column names of the table be the same as those of the DataFrame?

Yes, dataframe column names would match with table column names.

The other option that I can think of here is to convert the DataFrame to say, a list of tuples and then use something like the mysql-python-connector to load the data into the database,

rdd = df.rdd

b = rdd.map(tuple)

data = b.collect()

# write data to database using mysql-python-connector

No, never ever do this, this will defeat all purposes of using Spark (which is distributed computation). Check out the link above, you will find some good advises on where to start and how to read/write from/to a JDBC data source.

  • Related