I have 2 dataframes:
dataframe A:
column1 column2 column3
a 25 ast
b phone
c 32 dlp
d 45
e 60 phq
dataframe B:
column1 column2 column3
a 25 ast
b 15 phone
c 32 dlp
d 45 rare
e 60 phq
i want to compare both the dataframes and write all the missing values to another dataframe:
output should be like dataframe C:
column1 column2 column3
a
b 15
c
d rare
e
missing values can be either in dataframe A or dataframe B, how to achieve this using pyspark dataframes?
CodePudding user response:
Assume your dataset 1 is called first
and dataset 2 is called second
.
// we first have to rename the columns so we can distinguish which one belongs to which
first = first
.withColumnRenamed("column2", "fir_column2")
.withColumnRenamed("column3", "fir_column3")
second = second
.withColumnRenamed("column2", "sec_column2")
.withColumnRenamed("column3", "sec_column3")
Then we collect the columns:
val pairs = first
// join the tables on the unique key, in our case, `column1`
.join(second, Seq("column1"), "left")
// collect the pairs to array, so we can check for nulls
.withColumn("column2_pairs", array("fir_column2", "sec_column2"))
.withColumn("column3_pairs", array("fir_column3", "sec_column3"))
// we then select just the arrays, for the sake of explanation
.select("column1", "column2_pairs", "column3_pairs")
The output so far:
------- ------------- --------------
|column1|column2_pairs|column3_pairs |
------- ------------- --------------
|a |[25, 25] |[ast, ast] |
|b |[null, 15] |[phone, phone]|
|c |[32, 32] |[dlp, dlp] |
|d |[45, 45] |[null, rare] |
|e |[60, 60] |[phq, phq] |
------- ------------- --------------
What we want to do now, is operate on columns column2_pairs
and column3_pairs
, if there is one null select the other value. This can be done with:
val newTable = pairs
// in the collected array, if one of them is null, select the other value (for column2)
.withColumn("missing_2", expr("filter(column2_pairs, x -> x is not null)"))
.withColumn("missing_2_value", when(size(col("missing_2")).equalTo(1), col("missing_2").getItem(0)))
// in the collected array, if one of them is null, select the other value (for column3)
.withColumn("missing_3", expr("filter(column3_pairs, x -> x is not null)"))
.withColumn("missing_3_value", when(size(col("missing_3")).equalTo(1), col("missing_3").getItem(0)))
// do the final selection
.select("column1", "missing_2_value", "missing_3_value")
// renamings to get that to your desired format
.withColumnRenamed("missing_2_value", "column2")
.withColumnRenamed("missing_3_value", "column3")
Final output:
------- ------- -------
|column1|column2|column3|
------- ------- -------
|a |null |null |
|b |15 |null |
|c |null |null |
|d |null |rare |
|e |null |null |
------- ------- -------
Good luck!
CodePudding user response:
You can do with like this :
heywtu = pd.DataFrame(columns=['column1','column2','column3'],index=np.zeros(5))
for i in first.columns:
for j in range(first.shape[0]):
if first[i][j] == '' :
heywtu[i][j] = second[i][j]
heywtu['column1'] = first['column1'].values
print(heywtu.fillna(' '))
Output :
>>> column1 column2 column3
>>> 0.0 a
>>> 0.0 b 15
>>> 0.0 c
>>> 0.0 d rare
>>> 0.0 e