I'm trying to find a column (I do know the name of the column) base on a value. For example in this dataframe below, I'd like to know which row that has a column contains yellow
for Category = A
. The thing is I don't know the column name (colour) in advance so I couldn't do select * where Category = 'A' and colour = 'yellow'
How can I scan the columns and achieve this? Many thanks for your help.
-------- ----------- -------------
|Category|colour |. name. |
-------- ----------- -------------
|A. | blue.| Elmo|
|A | yellow | Alex|
|B | desc | Erin|
-------- ----------- -------------
CodePudding user response:
You can loop that check through the list of column names. You also can wrap this loop in a function for the readable purpose. Please note that this check per column would happen in sequence.
from pyspark.sql import functions as F
cols = df.columns
for c in cols:
cnt = df.where((F.col('Category') == 'A') & (F.col(c) == 'yellow')).count()
if cnt > 0:
print(c)