Home > database >  Delete rows from Pyspark Dataframe which match to header
Delete rows from Pyspark Dataframe which match to header

Time:02-02

I have a huge dataframe similar to this:

l = [('20190503', 'par1', 'feat2', '0x0'),
('20190503', 'par1', 'feat3', '0x01'),
('date', 'part', 'feature', 'value'),
('20190501', 'par5', 'feat9', '0x00'),
('20190506', 'par8', 'feat2', '0x00f45'),
('date', 'part', 'feature', 'value'),
('20190501', 'par11', 'feat3', '0x000000000'),
('date', 'part', 'feature', 'value'),
('20190501', 'par3', 'feat9', '0x000'),
('20190501', 'par6', 'feat5', '0x000000'),
('date', 'part', 'feature', 'value'),
('20190506', 'par8', 'feat1', '0x00000'),
('20190508', 'par3', 'feat6', '0x00000000'),
('20190503', 'par4', 'feat3', '0x0c0deffe21'),
('20190503', 'par6', 'feat4', '0x0000000000'),
('20190501', 'par3', 'feat6', '0x0123fe'),
('20190501', 'par7', 'feat4', '0x00000d0')]

columns = ['date', 'part', 'feature', 'value']


 -------- ----- ------- ------------ 
|    date| part|feature|       value|
 -------- ----- ------- ------------ 
|20190503| par1|  feat2|         0x0|
|20190503| par1|  feat3|        0x01|
|    date| part|feature|       value|
|20190501| par5|  feat9|        0x00|
|20190506| par8|  feat2|     0x00f45|
|    date| part|feature|       value|
|20190501|par11|  feat3| 0x000000000|
|    date| part|feature|       value|
|20190501| par3|  feat9|       0x000|
|20190501| par6|  feat5|    0x000000|
|    date| part|feature|       value|
|20190506| par8|  feat1|     0x00000|
|20190508| par3|  feat6|  0x00000000|
|20190503| par4|  feat3|0x0c0deffe21|
|20190503| par6|  feat4|0x0000000000|
|20190501| par3|  feat6|    0x0123fe|
|20190501| par7|  feat4|   0x00000d0|
 -------- ----- ------- ------------ 

It has rows that match to the header and I would want to drop all of them, so that the result would be:

 -------- ----- ------- ------------ 
|    date| part|feature|       value|
 -------- ----- ------- ------------ 
|20190503| par1|  feat2|         0x0|
|20190503| par1|  feat3|        0x01|
|20190501| par5|  feat9|        0x00|
|20190506| par8|  feat2|     0x00f45|
|20190501|par11|  feat3| 0x000000000|
|20190501| par3|  feat9|       0x000|
|20190501| par6|  feat5|    0x000000|
|20190506| par8|  feat1|     0x00000|
|20190508| par3|  feat6|  0x00000000|
|20190503| par4|  feat3|0x0c0deffe21|
|20190503| par6|  feat4|0x0000000000|
|20190501| par3|  feat6|    0x0123fe|
|20190501| par7|  feat4|   0x00000d0|
 -------- ----- ------- ------------ 

I tried to get rid of them with the method .distinct() but one is always left.

How can I do it?

CodePudding user response:

This would work (essentially chaining multiple filters, Spark will take care of merging them while creating physical plan)

for col in df.schema.names:
   df=df.filter(F.col(col) != col)

df.show()

Input:

Input

Output:

Out

  • Related