Home > Software engineering >  Creating a dataframe from Lists and string values in pyspark
Creating a dataframe from Lists and string values in pyspark

Time:03-03

I have 3 string values and 4 lists that Im trying to create a dataframe.

Lists -

>>> exp_type
[u'expect_column_to_exist', u'expect_column_values_to_not_be_null', u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal', u'expect_column_sum_to_be_between']

>>> expectations
[{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'}, {u'column': u'countray'}, {u'value': 10}, {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]

>>> results
[{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}, {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]

>>> this_exp_success
[True, True, True, False, False, False]

Strings -

>>> is_overall_success
'False'

>>> data_source
'stores'
>>> 
>>> run_time
'2022-02-24T05:43:16.678220 00:00'

Trying to create a dataframe as below.

columns = ['data_source', 'run_time', 'exp_type', 'expectations', 'results', 'this_exp_success', 'is_overall_success']

dataframe = spark.createDataFrame(zip(data_source, run_time, exp_type, expectations, results, this_exp_success, is_overall_success), columns)

Error -

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 748, in createDataFrame
    rdd, schema = self._createFromLocal(map(prepare, data), schema)
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 416, in _createFromLocal
    struct = self._inferSchemaFromList(data, names=schema)
  File "/usr/lib/spark/python/pyspark/sql/session.py", line 348, in _inferSchemaFromList
    schema = reduce(_merge_type, (_infer_schema(row, names) for row in data))
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1101, in _merge_type
    for f in a.fields]
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1114, in _merge_type
    _merge_type(a.valueType, b.valueType, name='value of map %s' % name),
  File "/usr/lib/spark/python/pyspark/sql/types.py", line 1094, in _merge_type
    raise TypeError(new_msg("Can not merge type %s and %s" % (type(a), type(b))))
TypeError: value of map field results: Can not merge type <class 'pyspark.sql.types.LongType'> and <class 'pyspark.sql.types.StringType'>

Expected Output


 --------------- -------------------------------- -------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------                                                                     
| data_source   |  run_time                      |exp_type                              |expectations                                                               |results                                                                                                                                                                                |this_exp_success|is_overall_success|
 --------------- -------------------------------- -------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ 
|stores         |2022-02-24T05:43:16.678220 00:00|expect_column_to_exist                |{u'column': u'country'}                                                    |{}                                                                                                                                                                                     |True            |False             |
|stores         |2022-02-24T05:43:16.678220 00:00|expect_column_values_to_not_be_null   |{u'column': u'country'}                                                    |{u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0, u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102}  |True            |False             |
|stores         |2022-02-24T05:43:16.678220 00:00|expect_column_values_to_be_of_type    |{u'column': u'country', u'type_': u'StringType'}                           |{u'observed_value': u'StringType'}                                                                                                                                                     |True            |False             |
|stores         |2022-02-24T05:43:16.678220 00:00|expect_column_to_exist                |{u'column': u'countray'}                                                   |{}                                                                                                                                                                                     |False           |False             |
|stores         |2022-02-24T05:43:16.678220 00:00|expect_table_row_count_to_equal       |{u'value': 10}                                                             |{u'observed_value': 102}                                                                                                                                                               |False           |False             |
|stores         |2022-02-24T05:43:16.678220 00:00|expect_column_sum_to_be_between       |{u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}       |{u'observed_value': 22075.0}                                                                                                                                                           |False           |False             |
 --------------- -------------------------------- -------------------------------------- --------------------------------------------------------------------------- --------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ 

CodePudding user response:

A solution to your problem would be the following:

Zip the lists and append the strings as spark literals.

exp_type = [u'expect_column_to_exist', u'expect_column_values_to_not_be_null',
            u'expect_column_values_to_be_of_type', u'expect_column_to_exist', u'expect_table_row_count_to_equal',
            u'expect_column_sum_to_be_between']

expectations = [{u'column': u'country'}, {u'column': u'country'}, {u'column': u'country', u'type_': u'StringType'},
                {u'column': u'countray'}, {u'value': 10},
                {u'column': u'active_stores', u'max_value': 1000, u'min_value': 100}]

results = [{}, {u'partial_unexpected_index_list': None, u'unexpected_count': 0, u'unexpected_percent': 0.0,
                u'partial_unexpected_list': [], u'partial_unexpected_counts': [], u'element_count': 102},
           {u'observed_value': u'StringType'}, {}, {u'observed_value': 102}, {u'observed_value': 22075.0}]

df_schema = StructType([StructField("exp_type", StringType(), True),
                        StructField("expectations", MapType(StringType(), StringType()), True),
                        StructField("results", StringType(), True),
                        StructField("this_exp_success", StringType(), True)
                        ])

this_exp_success = [True, True, True, False, False, False]

result = spark.createDataFrame(
    zip(exp_type, expectations, results, this_exp_success), df_schema)\
    .withColumn('is_overall_success', lit('False'))\
    .withColumn('data_source', lit('stores'))\
    .withColumn('run_time', lit('2022-02-24T05:43:16.678220 00:00'))

result.show(truncate=False)

And the result is showing:

 ----------------------------------- -------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ ----------- -------------------------------- 
|exp_type                           |expectations                                                  |results                                                                                                                                                      |this_exp_success|is_overall_success|data_source|run_time                        |
 ----------------------------------- -------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ ----------- -------------------------------- 
|expect_column_to_exist             |[column -> country]                                           |{}                                                                                                                                                           |true            |False             |stores     |2022-02-24T05:43:16.678220 00:00|
|expect_column_values_to_not_be_null|[column -> country]                                           |{partial_unexpected_list=[], partial_unexpected_index_list=null, unexpected_count=0, element_count=102, unexpected_percent=0.0, partial_unexpected_counts=[]}|true            |False             |stores     |2022-02-24T05:43:16.678220 00:00|
|expect_column_values_to_be_of_type |[column -> country, type_ -> StringType]                      |{observed_value=StringType}                                                                                                                                  |true            |False             |stores     |2022-02-24T05:43:16.678220 00:00|
|expect_column_to_exist             |[column -> countray]                                          |{}                                                                                                                                                           |false           |False             |stores     |2022-02-24T05:43:16.678220 00:00|
|expect_table_row_count_to_equal    |[value -> 10]                                                 |{observed_value=102}                                                                                                                                         |false           |False             |stores     |2022-02-24T05:43:16.678220 00:00|
|expect_column_sum_to_be_between    |[column -> active_stores, min_value -> 100, max_value -> 1000]|{observed_value=22075.0}                                                                                                                                     |false           |False             |stores     |2022-02-24T05:43:16.678220 00:00|
 ----------------------------------- -------------------------------------------------------------- ------------------------------------------------------------------------------------------------------------------------------------------------------------- ---------------- ------------------ ----------- -------------------------------- 

  • Related