How do I read a xml file in "pyspark"?-CodePudding

Other people uses this code.

 spark.read \
            .format('com.databricks.spark.xml') \
            .option('rootTag', 'tags') \
            .option('rowTag', 'row') \
            .load('example.xml')

I don't want to use databricks,
So, i tried like this.

df = spark.read.format('xml').options(rowTag='file').load('ted_en-20160408.xml')

but there are error.

 Py4JJavaError: An error occurred while calling o222.load.
: java.lang.ClassNotFoundException: Failed to find data source: xml. Please find packages at http://spark.apache.org/third-party-projects.html
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:692)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSourceV2(DataSource.scala:746)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:265)
    at org.apache.spark.sql.DataFrameReader.load(DataFrameReader.scala:239)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
    at java.lang.reflect.Method.invoke(Method.java:498)
    at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:244)
    at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:357)
    at py4j.Gateway.invoke(Gateway.java:282)
    at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:132)
    at py4j.commands.CallCommand.execute(CallCommand.java:79)
    at py4j.GatewayConnection.run(GatewayConnection.java:238)
    at java.lang.Thread.run(Thread.java:748)
Caused by: java.lang.ClassNotFoundException: xml.DefaultSource
    at java.net.URLClassLoader.findClass(URLClassLoader.java:382)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:418)
    at java.lang.ClassLoader.loadClass(ClassLoader.java:351)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$5(DataSource.scala:666)
    at scala.util.Try$.apply(Try.scala:213)
    at org.apache.spark.sql.execution.datasources.DataSource$.$anonfun$lookupDataSource$4(DataSource.scala:666)
    at scala.util.Failure.orElse(Try.scala:224)
    at org.apache.spark.sql.execution.datasources.DataSource$.lookupDataSource(DataSource.scala:666)
    ... 14 more

I want to read xml data, and parse the data.
My final goal is tf-idf and SVD.

Java : 1.8.0
spark : 3.1.2
Scala : 2.12.10
Python : 3.8.5

CodePudding user response：

But I don't want to use Databricks

Okay, then you need to implement your own Spark data format reader for XML since that's not a built-in option

Otherwise, write your parser elsewhere, then reformat your data to something Spark can work with out of the box. For example, read the complete file as a string, then use Python lxml or etree modules to build out a Dataframe with some schema

CodePudding user response：

You can try to read the content of the xml file as a string into the spark dataframe, and then use the Spark SQL xpath series of functions to process it.