Home > Net >  How to sort unbound list of PySpark columns by name?
How to sort unbound list of PySpark columns by name?

Time:01-07

This seems like it should be pretty simple, but I'm stumped for some reason. I have a list of PySpark columns that I would like to sort by name (including aliasing, as that will be how they are displayed/written to disk). Here's some example tests and things I've tried:

def test_col_sorting():
    from pyspark.sql import SparkSession
    import pyspark.sql.functions as f

    # Active spark context needed
    spark = SparkSession.builder.getOrCreate()

    # Data to sort
    cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]

    # Attempt 1
    result = sorted(cols)
    # This fails with ValueError: Cannot convert column into bool: please use '&' for 'and', '|' for 'or', '~' for 'not' when building DataFrame boolean expressions.

    # Attempt 2
    result = sorted(cols, key=lambda x: x.name())
    # Fails for the same reason, `name()` returns a Column object, not a string

    # Assertion I want to hold true:
    assert result = [f.col('a'), f.col('c'), f.col('b').alias('z')]

Is there any reasonable way to actually get the string back out of the Column object that was used to initialize it (but also respecting aliasing)? If I could get this from the object I could use it as a key.

Note that I am NOT looking to sort the columns on a DataFrame, as answered in this question: Python/pyspark data frame rearrange columns. These Column objects are not bound to any DataFrame. I also do not want to sort the column based on the values of the column.

CodePudding user response:

Answering my own question: it seems that you can't do this without some amount of parsing from the column string representation. You also don't need regex to handle this. These two methods should take care of it:

def get_column_name(col: Column) -> str:
    """
    PySpark doesn't allow you to directly access the column name with respect to aliases
    from an unbound column. We have to parse this out from the string representation.

    This works on columns with one or more aliases as well as unaliased columns.

    Returns:
        Col name as str, with respect to aliasing
    """
    c = str(col).lstrip("Column<'").rstrip("'>")
    return c.split(' AS ')[-1]


def sorted_columns(cols: List[Column]) -> List[Column]:
    """
    Returns sorted list of columns, with respect to aliases
    Args:
        cols: List of PySpark Columns (e.g. [f.col('a'), f.col('b').alias('c'), ...])

    Returns:
        Sorted list of PySpark Columns by name, with respect to aliasing
    """
    return sorted(cols, key=lambda x: get_column_name(x))

Some tests to validate behavior:

import pytest
from pyspark.sql import SparkSession

@pytest.fixture(scope="session")
def spark() -> SparkSession:
    # Provide a session spark fixture for all tests
    yield SparkSession.builder.getOrCreate()

def test_get_col_name(spark: SparkSession):
    col = f.col('a')
    actual = get_column_name(col)
    assert actual == 'a'


def test_get_col_name_alias(spark: SparkSession):
    col = f.col('a').alias('b')
    actual = get_column_name(col)
    assert actual == 'b'


def test_get_col_name_multiple_alias(spark: SparkSession):
    col = f.col('a').alias('b').alias('c')
    actual = get_column_name(col)
    assert actual == 'c'


def test_sorted_columns(spark: SparkSession):
    cols = [f.col('z').alias('c'), f.col('a'), f.col('d').alias('e').alias('f'), f.col('b')]
    actual = sorted_columns(cols)
    expected = [f.col('a'), f.col('b'), f.col('z').alias('c'), f.col('d').alias('e').alias('f')]

    # We can't directly compare lists of cols, so we zip and check the repr of each element
    for a, b in zip(actual, expected):
        assert str(a) == str(b)

I think it's fair to say being unable to access this information in a truthy way is a failure of the PySpark API. There are a multitude of valid reasons to want to ascertain what name an unbound Column type would be resolved to, and it should not have to be parsed in such a hacky way.

CodePudding user response:

If you're only interested in grabbing the column names, and sorting those (without any relation to any data) you can use the column object's __repr__ method and use regex to extract the actual name of your column.

So for these columns

import pyspark.sql.functions as f
cols = [f.col('c'), f.col('a'), f.col('b').alias('z')]

You could do this:

import re

# Making a list of string representation of our columns
col_repr = [x.__repr__() for x in cols]
["Column<'c'>", "Column<'a'>", "Column<'b AS z'>"]

# Using regex to extract the interesting part of the column name
# while making sure we're properly grabbing the alias name. Notice
# that we're grabbing the right part of the column name in `b AS z`
col_names = [re.search('([a-zA-Z])\'>', x).group(1) for x in col_repr]
['c', 'a', 'z']

# Sorting this array
sorted_col_names = sorted(col_names)
['a', 'c', 'z']

NOTE: This example is simple (only accepting lowercase and uppercase letters as column names) but as your column names get more complex, it's just a question of adapting your regex pattern.

  • Related