Home > Software design >  How to order columns in pyspark in a specific sequence based on a list?
How to order columns in pyspark in a specific sequence based on a list?

Time:02-26

I have a dataframe in Spark that looks like this (but with more rows), where each city has the number of visitors on my website.

| date        | New York | Los Angeles | Tokyo | London | Berlin | Paris |
|:----------- |:--------:| -----------:|------:|-------:|-------:|------:|
| 2022-01-01  | 150000   | 1589200     | 500120| 120330 |95058331|980000 |

I wanted to order the columns based onn this list of cities (they are ordered according to their importance to me)

order = ["Paris", "Berlin", "London", "New York", "Los Angeles", "Tokyo"]

In the end, I need a dataframe like this. Is there any way to create a function that perform this ordering everytime I need it? Expected result bellow:

| date        | Paris    | Berlin  | London | New York | Los Angeles | Tokyo |
|:----------- |:--------:| -------:|-------:|---------:|------------:|------:|
| 2022-01-01  | 980000   | 95058331| 120330 | 150000   | 1589200     | 500120| 

Thank you!

CodePudding user response:

you order it from the start launching point of the discard launch

CodePudding user response:

Your exemple:

df_exemple  = spark.createDataFrame(
  [
('2022-01-01','150000 ','1589200','500120','120330','95058331','980000')
  ], ['date', 'New York', 'Los Angeles', 'Tokyo', 'London', 'Berlin', 'Paris'])

order = ['Paris', 'Berlin', 'London', 'New York', 'Los Angeles', 'Tokyo']

Now, a simple function to reorder:

def order_func(df, order_list):
    return df.select('date', *order_list)

result_df = order_func(df_exemple, order)
result_df.show()
 ---------- ------ -------- ------ -------- ----------- ------ 
|      date| Paris|  Berlin|London|New York|Los Angeles| Tokyo|
 ---------- ------ -------- ------ -------- ----------- ------ 
|2022-01-01|980000|95058331|120330| 150000 |    1589200|500120|
 ---------- ------ -------- ------ -------- ----------- ------ 
  • Related