Home > Net >  create links dynamically with pyspark from nodes and reference dataframe
create links dynamically with pyspark from nodes and reference dataframe

Time:12-07

I want to make a new dataframe based on these dataframes:

DF1:

| Name  | ID  | Group    |
| ----- | --- | -------- |
| A     | 0   | mgr      |
| B     | 1   | mgr      |
| C     | 2   | mgr      |
| D     | 3   | hr       |
| E     | 4   | hr       |
| F     | 5   | hr       |
| G     | 6   | adm      |
| H     | 7   | adm      |
| I     | 8   | adm      |

DF2:

| Mgrs  | HR    | Admin    | Value |
| ------| ----- | -------- | ----- |
| A     | D     | G        | .0010 |
| B     | E     | H        | .0002 |
| C     | F     | I        | .0035 |

I want to iterate over DF2 by DF1 and produce a new dataframe:

DF3:

| From  | To  | Value|
| ----- | --- | -----|
| 0     | 3   | .0010|
| 1     | 4   | .0002|
| 2     | 5   | .0035|
| 3     | 6   | .0010|
| 4     | 7   | .0002| 
| 5     | 8   | .0035|

I’m wanting to start at the top of DF1 (A) to retrieve ID (0), place that in the “From” column in DF3, and then the “To” column key would be the right adjacent column in DF2 (some of DF2’s column headers correspond to the values in the Group column in DF1). I also want to preserve the Value column into DF3, corresponding to DF2’s rows. Any suggestions on how to accomplish this with PySpark? I was thinking perhaps a join of some kind might work too.

CodePudding user response:

As you said, a couple of joins would work here

Starting with getting IDs
temp = (df2
    .join(df1, on=df1['Name'] == df2['Mgrs'], how='left')
    .select(F.col('ID').alias('Mgrs'), 'HR', 'Admin', 'Value')
 
    .join(df1, on=df1['Name'] == df2['HR'], how='left')
    .select('Mgrs', F.col('ID').alias('HR'), 'Admin', 'Value')
 
    .join(df1, on=df1['Name'] == df2['Admin'], how='left')
    .select('Mgrs', 'HR', F.col('ID').alias('Admin'), 'Value')
)

 ---- --- ----- ------ 
|Mgrs| HR|Admin| Value|
 ---- --- ----- ------ 
|   0|  3|    6| 0.001|
|   1|  4|    7|2.0E-4|
|   2|  5|    8|0.0035|
 ---- --- ----- ------ 
Then process your mapping logic
one = temp.select(F.col('Mgrs').alias('from'), F.col('HR').alias('to'), 'Value')
two = temp.select(F.col('HR').alias('from'), F.col('Admin').alias('to'), 'Value')
one.union(two).show()

 ---- --- ------ 
|from| to| Value|
 ---- --- ------ 
|   0|  3| 0.001|
|   1|  4|2.0E-4|
|   2|  5|0.0035|
|   3|  6| 0.001|
|   4|  7|2.0E-4|
|   5|  8|0.0035|
 ---- --- ------ 
  • Related