I have 3 string values and 4 lists that Im trying to create a dataframe.
Lists -
>>> exp_type
[u'expect_column_to_exist', u'expect_column_values_to_not_be_null', u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal', u'expect_column_sum_to_be_between']
>>> expectations
[{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'}, {u'column': u'countray'}, {u'value': 10}, {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
>>> results
[{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}, {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
>>> this_exp_success
[True, True, True, False, False, False]
Strings -
>>> is_overall_success
'False'
>>> data_source
'stores'
>>>
>>> run_time
'2022-02-24T05:43:16.678220 00:00'
Trying to create a dataframe as below.
columns = ['data_source', 'run_time', 'exp_type', 'expectations', 'results', 'this_exp_success', 'is_overall_success']
dataframe = spark.createDataFrame(zip(data_source, run_time, exp_type, expectations, results, this_exp_success, is_overall_success), columns)
Error -
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/lib/spark/python/pyspark/sql/session.py", line 748, in createDataFrame
rdd, schema = self._createFromLocal(map(prepare, data), schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 416, in _createFromLocal
struct = self._inferSchemaFromList(data, names=schema)
File "/usr/lib/spark/python/pyspark/sql/session.py", line 348, in _inferSchemaFromList
schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1101, in _merge_type
for f in a.fields]
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1114, in _merge_type
_merge_type(a.valueType, b.valueType, name='value of map %s' % name),
File "/usr/lib/spark/python/pyspark/sql/types.py", line 1094, in _merge_type
raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
TypeError: value of map field results: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>
Expected Output
--------------- -------------------------------- -------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------
| data_source | run_time |exp_type |expectations |results |this_exp_success|is_overall_success|
--------------- -------------------------------- -------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------
|stores |2022-02-24T05:43:16.678220 00:00|expect_column_to_exist |{u'column': u'country'} |{} |True |False |
|stores |2022-02-24T05:43:16.678220 00:00|expect_column_values_to_not_be_null |{u'column': u'country'} |{u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102} |True |False |
|stores |2022-02-24T05:43:16.678220 00:00|expect_column_values_to_be_of_type |{u'column': u'country', u'type_': u'StringType'} |{u'observed_value': u'StringType'} |True |False |
|stores |2022-02-24T05:43:16.678220 00:00|expect_column_to_exist |{u'column': u'countray'} |{} |False |False |
|stores |2022-02-24T05:43:16.678220 00:00|expect_table_row_count_to_equal |{u'value': 10} |{u'observed_value': 102} |False |False |
|stores |2022-02-24T05:43:16.678220 00:00|expect_column_sum_to_be_between |{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100} |{u'observed_value': 22075.0} |False |False |
--------------- -------------------------------- -------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------
CodePudding user response:
A solution to your problem would be the following:
Zip the lists and append the strings as spark literals.
exp_type = [u'expect_column_to_exist', u'expect_column_values_to_not_be_null',
u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal',
u'expect_column_sum_to_be_between']
expectations = [{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'},
{u'column': u'countray'}, {u'value': 10},
{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]
results = [{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0,
u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102},
{u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]
df_schema = StructType([StructField("exp_type", StringType(), True),
StructField("expectations", MapType(StringType(), StringType()), True),
StructField("results", StringType(), True),
StructField("this_exp_success", StringType(), True)
])
this_exp_success = [True, True, True, False, False, False]
result = spark.createDataFrame(
zip(exp_type, expectations, results, this_exp_success), df_schema)\
.withColumn('is_overall_success', lit('False'))\
.withColumn('data_source', lit('stores'))\
.withColumn('run_time', lit('2022-02-24T05:43:16.678220 00:00'))
result.show(truncate=False)
And the result is showing:
----------------------------------- -------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ ----------- --------------------------------
|exp_type |expectations |results |this_exp_success|is_overall_success|data_source|run_time |
----------------------------------- -------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ ----------- --------------------------------
|expect_column_to_exist |[column -> country] |{} |true |False |stores |2022-02-24T05:43:16.678220 00:00|
|expect_column_values_to_not_be_null|[column -> country] |{partial_unexpected_list=[], partial_unexpected_index_list=null, unexpected_count=0, element_count=102, unexpected_percent=0.0, partial_unexpected_counts=[]}|true |False |stores |2022-02-24T05:43:16.678220 00:00|
|expect_column_values_to_be_of_type |[column -> country, type_ -> StringType] |{observed_value=StringType} |true |False |stores |2022-02-24T05:43:16.678220 00:00|
|expect_column_to_exist |[column -> countray] |{} |false |False |stores |2022-02-24T05:43:16.678220 00:00|
|expect_table_row_count_to_equal |[value -> 10] |{observed_value=102} |false |False |stores |2022-02-24T05:43:16.678220 00:00|
|expect_column_sum_to_be_between |[column -> active_stores, min_value -> 100, max_value -> 1000]|{observed_value=22075.0} |false |False |stores |2022-02-24T05:43:16.678220 00:00|
----------------------------------- -------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ ----------- --------------------------------