I'm trying to delete null values of some columns in dataframe but I'm getting different number of rows both python and scala.
I did the same for both. In python I receive 2127178 rows and scala i receive 8723 rows.
For example in python i did:
dfplaneairport.dropna(subset=["model"], inplace= True)
dfplaneairport.dropna(subset=["engine_type"], inplace= True)
dfplaneairport.dropna(subset=["aircraft_type"], inplace= True)
dfplaneairport.dropna(subset=["status"], inplace= True)
dfplaneairport.dropna(subset=["ArrDelay"], inplace= True)
dfplaneairport.dropna(subset=["issue_date"], inplace= True)
dfplaneairport.dropna(subset=["manufacturer"], inplace= True)
dfplaneairport.dropna(subset=["type"], inplace= True)
dfplaneairport.dropna(subset=["tailnum"], inplace= True)
dfplaneairport.dropna(subset=["DepDelay"], inplace= True)
dfplaneairport.dropna(subset=["TaxiOut"], inplace= True)
dfplaneairport.shape
(2127178, 32)
and spark scala i did:
dfairports = dfairports.na.drop(Seq("engine_type", "aircraft_type", "status", "model", "issue_date", "manufacturer", "type","ArrDelay", "DepDelay", "TaxiOut", "tailnum"))
dfairports.count()
8723
I am expecting the same number of rows and i'm don't know what i'm doing wrong
I would appreciate any help
CodePudding user response:
Welcome to Stackoverflow!
You seem to not be using the Pyspark dropna
function, but the Pandas one. Notice the fact that you're using the inplace
input argument whereas that does not exist in the Pyspark function.
Here are 2 bits of code (in Scala and in Pyspark) that behave exactly the same way.
Scala:
import spark.implicits._
val df = Seq(
("James",null,"Smith","36636","M",3000), ("Michael","Rose",null,"40288","M",4000),
("Robert",null,"Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown",null,"F",-1)
).toDF("firstname", "middlename", "lastname", "id", "gender", "salary")
df.show
--------- ---------- -------- ----- ------ ------
|firstname|middlename|lastname| id|gender|salary|
--------- ---------- -------- ----- ------ ------
| James| null| Smith|36636| M| 3000|
| Michael| Rose| null|40288| M| 4000|
| Robert| null|Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
--------- ---------- -------- ----- ------ ------
df.na.drop(Seq("middlename", "lastname")).show
--------- ---------- -------- ----- ------ ------
|firstname|middlename|lastname| id|gender|salary|
--------- ---------- -------- ----- ------ ------
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
--------- ---------- -------- ----- ------ ------
Pyspark:
data = [("James",None,"Smith","36636","M",3000), ("Michael","Rose",None,"40288","M",4000),
("Robert",None,"Williams","42114","M",4000),
("Maria","Anne","Jones","39192","F",4000),
("Jen","Mary","Brown",None,"F",-1)
]
df = spark.createDataFrame(data, ["firstname", "middlename", "lastname", "id", "gender", "salary"])
df.show()
--------- ---------- -------- ----- ------ ------
|firstname|middlename|lastname| id|gender|salary|
--------- ---------- -------- ----- ------ ------
| James| null| Smith|36636| M| 3000|
| Michael| Rose| null|40288| M| 4000|
| Robert| null|Williams|42114| M| 4000|
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
--------- ---------- -------- ----- ------ ------
df.dropna(subset=["middlename", "lastname"]).show()
--------- ---------- -------- ----- ------ ------
|firstname|middlename|lastname| id|gender|salary|
--------- ---------- -------- ----- ------ ------
| Maria| Anne| Jones|39192| F| 4000|
| Jen| Mary| Brown| null| F| -1|
--------- ---------- -------- ----- ------ ------
Hope this helps! :)