Home > database >  Pyspark Dataframe Iterate Array Columns
Pyspark Dataframe Iterate Array Columns

Time:06-29

In PySpark, I have a dataframe I'm trying to parse multiple columns with arrays. The last two rows in the dataframe contains multiple values I would like to parse into separate rows.

 ------------- --------------- ------------- -------------------- -------------- ------------- ---------------------- -------------- 
| WB-API-CNTY | WB-API-UNIQUE | WB-OIL-CODE | WB-OIL-LSE-NBR     | WB-OIL-DIST  | WB-GAS-CODE | WB-GAS-RRC-ID        | WB-GAS-DIS   |
 ------------- --------------- ------------- -------------------- -------------- ------------- ---------------------- -------------- 
| 449         | 80212         | []          | []                 | []           | []          | []                   | []           |
 ------------- --------------- ------------- -------------------- -------------- ------------- ---------------------- -------------- 
| 449         | 80214         | ["O"]       | ["05361"]          | ["06"]       | ["O"]       | ["060536"]           | ["00"]       |
 ------------- --------------- ------------- -------------------- -------------- ------------- ---------------------- -------------- 
| 449         | 80222         | ["O", "O"]  | ["01718", "05492"] | ["06", "06"] | ["O", "O"]  | ["060171", "060549"] | ["00", "00"] |
 ------------- --------------- ------------- -------------------- -------------- ------------- ---------------------- -------------- 
| 451         | 00005         | ["G", "O"]  | ["5568", "04351"]  | ["10", "09"] | ["G", "O"]  | ["105568", "090435"] | ["09", "00"] |
 ------------- --------------- ------------- -------------------- -------------- ------------- ---------------------- -------------- 

Results:

 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 
| WB-API-CNTY | WB-API-UNIQUE | WB-OIL-CODE | WB-OIL-LSE-NBR | WB-OIL-DIST | WB-GAS-CODE | WB-GAS-RRC-ID | WB-GAS-DIS |
 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 
| 449         | 80212         |             |                |             |             |               |            |
 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 
| 449         | 80214         | O           | 05361          | 06          | O           | 060536        | 00         |
 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 
| 449         | 80222         | O           | 01718          | 06          | O           | 060171        | 00         |
 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 
| 449         | 80222         | O           | 05492          | 06          | O           | 060549        | 00         |
 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 
| 451         | 00005         | G           | 5568           | 10          | G           | 105568        | 09         |
 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 
| 451         | 00005         | O           | 04351          | 09          | O           | 090435        | 00         |
 ------------- --------------- ------------- ---------------- ------------- ------------- --------------- ------------ 

CodePudding user response:

array_cols = ['WB-OIL-CODE', 'WB-OIL-LSE-NBR', 'WB-OIL-DIST', 'WB-GAS-CODE', 'WB-GAS-RRC-ID', 'WB-GAS-DIS']
other_cols = [c for c in df.columns if c not in array_cols]
df = df.select(
    *other_cols,
    F.expr(f"inline(arrays_zip({'`'   '`,`'.join(array_cols)   '`'}))")
)
  • Related