Home > OS >  How to assign array in a dataframe to a variable
How to assign array in a dataframe to a variable

Time:11-23

I need to fetch my array field in dataframe and assign it to a variable for further proceeding further. I am using collect() function, but its not working properly.

Input dataframe:

Department Language
[A, B, C] English
[] Spanish

How can i fetch and assign variable like below:

English = [A,B,C]

Spanish = []

CodePudding user response:

The simplest solution I came with is just extracting data with collect and explicitly assigning it to the predefined variables, like so:

from pyspark.sql.types import StringType, ArrayType, StructType, StructField

schema = StructType([
    StructField("Department", ArrayType(StringType()), True),
    StructField("Language", StringType(), True)
  ])

df = spark.createDataFrame([(["A", "B", "C"], "English"), ([], "Spanish")], schema)

English = df.collect()[0]["Department"]
Spanish = df.collect()[1]["Department"]
print(f"English: {English}, Spanish: {Spanish}")

# English: ['A', 'B', 'C'], Spanish: []

CodePudding user response:

EDIT: I completely brain-farted and missed that this was a PySpark question.

The below code might still be helpful if you convert your PySpark Dataframe to pandas, which for your situation might not be as ridiculous as it sounds. If the table is too big to fit in a pandas DataFrame then its too big to store all arrays in a variable. You can probably use .filter() and .select() to shrink it first.

Old Answer:


The best way to approach this really depends on the complexity of your dataframe. Here are two ways:

# To recreate your dataframe

df = pd.DataFrame({
    'Department': [['A','B', 'C']],
    'Language': 'English'
})

df.loc[df.Language == 'English']
# Will return all rows where Language is English.  If you only want Department then:

df.loc[df.Language == 'English'].Department
# This will return a list containing your list. If you are always expecting a single match add [0] as in:

df.loc[df.Language == 'English'].Department[0]
#Which will return only your list
# The alternate method below isn't great but might be preferable in some circumstances, also only if you expect a single match from any query.

department_lookup = df[['Language', 'Department']].set_index('Language').to_dict()['Department']

department_lookup['English']
#returns your list

# This will make a dictionary where 'Language' is the key and 'Department' is the value. It is more work to set up and only works for a two-column relationship but you might prefer working with dictionaries depending on the use-case

If you're having datatype issues it may deal with how the DataFrame is being loaded rather than how you're accessing it. Pandas loves to convert lists to strings.


# If I saved and reload the df as so: 
df.to_csv("the_df.csv")
df = pd.read_csv("the_df.csv")

# Then we would see that the dtype has become a string, as in "[A, B, C]" rather than ["A", "B", "C"]

# We can typically correct this by giving pandas a method for converting the incoming string to list.  This is done with the 'converters' argument, which takes a dictionary where trhe keys are column names and the values are functions, as such:

df = pd.read_csv("the_df.csv", converters = {"Department": lambda x: x.strip("[]").split(", "))

# df['Department'] should have a dtype of list

Its important to note that the lambda function is only reliable if python has converted a python list into a string in order to store the dataframe. Converting a list string into a list has been addressed here

  • Related