I'm trying to remove duplicate numbers from a text file using pyspark python, but the operation applies only to row. e.g my text file is:
3
66
4
9
3
23
Below is the code that i tried:
import pyspark
from pyspark import SparkContext, SparkConf
from collections import OrderedDict
sc = SparkContext.getOrCreate()
data = sc.textfile('file.txt')
new_data = data.map(lambda x: list(OrderedDict.fromkeys(x)))
new_data.collect()
I get the output as: [['3'], ['6'], ['4'], ['9'], ['3'],['2','3'] ]
But I want: [3, 66, 4, 9, 23]
CodePudding user response:
You're mapping a dict function over all entries, which will return a RDD with entries that contain collections.
To simply get unique rows of a dataframe, use distinct()
import pyspark
from pyspark.sql import SparkSession
spark = SparkSession.builder\
.master("local")\
.appName("Unique Example")\
.getOrCreate()
df = spark.read.text("file.txt")
df.distinct().show()
Note that this uses SparkSQL DataFrame API, which is the preferred mode of operation for most actions, compared to your code which uses RDDs, which also have a distinct
function
CodePudding user response:
I assume, you are reading a text file with a single columnar data having only numbers as you have shown. Here are few possible solutions.
1.Removing duplicates
from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()
df = spark.read.text("file.txt").drop_duplicates()
df.show()
2.If you would like to target specific columnar position of the text file, create a new column for that position and apply the same process.
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()
df = df.withColumn("col1", col('value').substr(starting_position, length))
df.select("col1").drop_duplicates().show()