How to change dataframe schema based on JSON string?-CodePudding

I have downloaded the corpus of articles Aminar DBLP Version 11. The corpus is a huge text file (12GB) which each line is a self-contained JSON string:

'{"id": "100001334", "title": "Ontologies in HYDRA - Middleware for Ambient Intelligent Devices.", "authors": [{"name": "Peter Kostelnik", "id": "2702511795"}, {"name": "Martin Sarnovsky", "id": "2041014688"}, {"name": "Jan Hreno", "id": "2398560122"}], "venue": {"raw": "AMIF"}, "year": 2009, "n_citation": 2, "page_start": "43", "page_end": "46", "doc_type": "", "publisher": "", "volume": "", "issue": "", "fos": [{"name": "Lernaean Hydra", "w": 0.4178039}, {"name": "Database", "w": 0.4269269}, {"name": "World Wide Web", "w": 0.415332377}, {"name": "Ontology (information science)", "w": 0.459045082}, {"name": "Computer science", "w": 0.399807781}, {"name": "Middleware", "w": 0.5905041}, {"name": "Ambient intelligence", "w": 0.5440575}]}'

All JSON strings are new line separated.

When I open the file using PySpark, it returns a dataframe with one column containing JSON strings:

df = spark.read.text(path_to_data)
df.show()

 -------------------- 
|               value|
 -------------------- 
|{"id": "100001334...|
|{"id": "100001888...|
|{"id": "100002270...|
|{"id": "100004108...|
|{"id": "10000571"...|
|{"id": "100007563...|
|{"id": "100008278...|
|{"id": "100008490...|

I need to access JSON fields to build my deep learning model.

My first attempt was trying to open the file using JSON method as mentioned in this question:

df = spark.read.option("wholeFile", True).option("mode", "PERMISSIVE").json(path_to_data)

But all the proposed solutions took ages to run (more than 3h) with no results to show.

My second attempt was trying to parse a JSON object from JSON string to get a dataframe with columns as follows:

df = spark.read.text(path_to_data)
schema = StructType([StructField("id", StringType()), StructField("title", StringType()), StructField("authors", ArrayType(MapType(StringType(), StringType()))), StructField("venue", MapType(StringType(), StringType()), True), StructField("year", IntegerType(), True), StructField("keywords", ArrayType(StringType()), True), StructField("references", ArrayType(StringType()), True), StructField("n_citation", IntegerType(), True), StructField("page_start", StringType(), True), StructField("page_end", StringType(), True), StructField("doc_type", StringType(), True), StructField("lang", StringType(), True), StructField("publisher", StringType(), True), StructField("volume", StringType(), True), StructField("issue", StringType(), True), StructField("issn", StringType(), True), StructField("isbn", StringType(), True), StructField("doi", StringType(), True), StructField("pdf", StringType(), True), StructField("url", ArrayType(StringType()), True), 
 StructField("abstract", StringType(), True), StructField("indexed_abstract", StringType(), True)])

datajson = df.withColumn("jsonData", from_json(col("value"),schema)).select("jsonData.*")

But it returned the exception "cannot resolve column due to data type mismatch PySpark", even though the data types of each field in the schema are true (based on the official website of corpus here)

My third attempt was trying to parse the JSON string to Map data type:

casted = df.withColumn("value", from_json(df.value, MapType(StringType(),StringType())))

It gave me the following result:

root
 |-- value: map (nullable = true)
 |    |-- key: string
 |    |-- value: string (valueContainsNull = true)

 -------------------- 
|               value|
 -------------------- 
|{id -> 100001334,...|
|{id -> 1000018889...|
|{id -> 1000022707...|
|{id -> 100004108,...|
|{id -> 10000571, ...|
|{id -> 100007563,...|
|{id -> 100008278,...|

Now, each row is a valid JSON object which can be accessed as follows:

row = casted.first()
row.value['id']
row.value['title']
row.value['authors']

Now, my question is how to convert this dataframe of one column named 'value' to a dataframe with the columns mentioned above (id, title, authors, etc) based on JSON objects?

CodePudding user response：

Reading the file without providing the schema is taking longer time. I tried to split the huge file in smaller chunks to understand the schema and it failed with Found duplicate column(s) in the data schema:

I tried the below approach on the same dataset with provided schema and it worked.

from pyspark.sql.types import StructType,StructField,StringType,IntegerType,ArrayType
schema = StructType([StructField("id", StringType()), StructField("title", StringType()), StructField("authors", ArrayType(MapType(StringType(), StringType()))), StructField("venue", MapType(StringType(), StringType()), True), StructField("year", IntegerType(), True), StructField("keywords", ArrayType(StringType()), True), StructField("references", ArrayType(StringType()), True), StructField("n_citation", IntegerType(), True), StructField("page_start", StringType(), True), StructField("page_end", StringType(), True), StructField("doc_type", StringType(), True), StructField("lang", StringType(), True), StructField("publisher", StringType(), True), StructField("volume", StringType(), True), StructField("issue", StringType(), True), StructField("issn", StringType(), True), StructField("isbn", StringType(), True), StructField("doi", StringType(), True), StructField("pdf", StringType(), True), StructField("url", ArrayType(StringType()), True), 
 StructField("abstract", StringType(), True), StructField("indexed_abstract", StringType(), True)])

df = spark.read.option("wholeFile", True).option("mode", "PERMISSIVE").schema(schema).json("dblp_papers_v11.txt")

df.show()

output


[![ ---------- -------------------- -------------------- -------------------- ---- -------- -------------------- ---------- ---------- -------- ---------- ---- -------------------- ------ ----- ---- ---- -------------------- ---- ---- -------- -------------------- 
|        id|               title|             authors|               venue|year|keywords|          references|n_citation|page_start|page_end|  doc_type|lang|           publisher|volume|issue|issn|isbn|                 doi| pdf| url|abstract|    indexed_abstract|
 ---------- -------------------- -------------------- -------------------- ---- -------- -------------------- ---------- ---------- -------- ---------- ---- -------------------- ------ ----- ---- ---- -------------------- ---- ---- -------- -------------------- 
| 100001334|Ontologies in HYD...|\[{name -> Peter K...|       {raw -> AMIF}|2009|    null|                null|         2|        43|      46|          |null|                    |      |     |null|null|                null|null|null|    null|                null|
|1000018889|Remote Policy Enf...|\[{name -> Fabio M...|{raw -> internati...|2013|    null|\[94181602, 150466...|         2|        70|      84|Conference|null|      Springer, Cham|      |     |null|null|10.1007/978-3-319...|null|null|    null|{"IndexLength":17...|
|1000022707|A SIMPLE OBSERVAT...|\[{name -> Jerzy M...|{raw -> Reports o...|2009|    null|\[1972178849, 2069...|         0|        19|      29|   Journal|null|                    |    44|     |null|null|                null|null|null|    null|{"IndexLength":49...|
| 100004108|Gait based human ...|\[{name -> Emdad H...|{raw -> internati...|2012|    null|\[1578000111, 2120...|         0|       319|     328|Conference|null|Springer, Berlin,...|      |     |null|null|10.1007/978-3-642...|null|null|    null|{"IndexLength":82...|
|  10000571|The GAME Algorith...|\[{name -> Pavel K...|{raw -> internati...|2008|    null|\[291899685, 19641...|         5|       859|     868|Conference|null|Springer, Berlin,...|      |     |null|null|10.1007/978-3-540...|null|null|    null|{"IndexLength":17...|
| 100007563|Formal Verificati...|\[{name -> George ...|{raw -> Software ...|2006|    null|\[1578963809, 1612...|         1|       650|     656|   Journal|null|                    |      |     |null|null|                null|null|null|    null|{"IndexLength":87...|
| 100008278|EMOTIONAL AND RAT...|\[{name -> Colin G...|{raw -> internati...|2010|    null|\[116282327, 14967...|         2|       238|        |Conference|null|                    |      |     |null|null|                null|null|null|    null|{"IndexLength":12...|
| 100008490|Principle-Based P...|\[{name -> Sandiwa...|{raw -> Natural L...|1991|    null|                null|         3|        43|      60|          |null|                    |      |     |null|null|                null|null|null|    null|                null|][1]][1]