Home > other >  Convert PySpark data types into dictionary
Convert PySpark data types into dictionary

Time:06-11

I have around 20 PySpark tables that I need to convert to Pandas (to apply some variable statistics functions that are built using Pandas).

Now I have the problem that all the decimal(x,y) data types are converted to object data type in Pandas. That's why I wanted to automatically convert them to the correct Pandas dtype.

Basically I have the translation PySpark<=>Pandas dtype that I want. And I know how to retrieve PySpark dtypes. But I don't know how I can convert the PySpark dtypes into usable dictionary.

This is my code that would work if I had PySpark dtypes in dictionary format - with example "dict" of what I want to achieve:

dict = {'Var1': 'TimestampType', 'Var2': 'TimestampType', 'Var3': 'LongType', 'Var4': 'StringType', 'Var5': 'DecimalType(38,18)'}
dict2 = {'ByteType':'np.int8','ShortType':'np.int16','IntegerType':'np.int32','LongType':'np.int64','FloatType':'np.float32','DoubleType':'np.float64','DecimalType':'np.float64','BooleanType':'np.bool','TimestampType':'np.datetime64','TimestampNTZType':'np.datetime64','DayTimeIntervalType':'np.timedelta64','StringType':'object'}

for col in dict:
    dict[col] = eval(dict2[(''.join(dict[col].partition('Type')[:2]))]) # Explanation of dict2 value: Removes everything after Type => needed for DecimalType(x,y) where x & y can vary.
    
print(dict)

I would then convert using "asType" the Pandas formats of each column of each table:

pd_df = df.toPandas()
pd_df = pd_df.astype(dict)

How do I get the result from print(df.schema.fields) into the structure of "dict" (see above)? This is an example of print(df.schema.fields):

[StructField(Var1,TimestampType,true),
StructField(Var2,TimestampType,true),
StructField(Var3,LongType,true),
StructField(Var4,StringType,true),
StructField(Var5,DecimalType(38,18),true)]

Thanks :) And of course, I'm always open to generally better solutions for the task I want to achieve.

CodePudding user response:

This doesn't include () after Type:

import re
matches = re.findall(r'(?:Field\()(. ?),(. ?Type)', str(df.schema.fields))
dict = {x[0]: x[1] for x in matches}

This includes () after Type:

import re
matches = re.findall(r'(?:Field\()(. ?),(. ?)(?:,t|f)', str(df.schema.fields))
dict = {x[0]: x[1] for x in matches}
  • Related