How to remove duplicate numbers from text file using pyspark python-CodePudding

I'm trying to remove duplicate numbers from a text file using pyspark python, but the operation applies only to row. e.g my text file is:

Below is the code that i tried:

import pyspark
from pyspark import SparkContext, SparkConf
from collections import OrderedDict
sc = SparkContext.getOrCreate()
data = sc.textfile('file.txt')
new_data = data.map(lambda x: list(OrderedDict.fromkeys(x)))
new_data.collect()

I get the output as: [['3'], ['6'], ['4'], ['9'], ['3'],['2','3'] ]

But I want: [3, 66, 4, 9, 23]

CodePudding user response：

You're mapping a dict function over all entries, which will return a RDD with entries that contain collections.

To simply get unique rows of a dataframe, use distinct()

import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder\
       .master("local")\
       .appName("Unique Example")\
       .getOrCreate()

df = spark.read.text("file.txt")
df.distinct().show()

Note that this uses SparkSQL DataFrame API, which is the preferred mode of operation for most actions, compared to your code which uses RDDs, which also have a distinct function

CodePudding user response：

I assume, you are reading a text file with a single columnar data having only numbers as you have shown. Here are few possible solutions.

1.Removing duplicates

from pyspark.sql import SparkSession
spark = SparkSession.builder.getOrCreate()

df = spark.read.text("file.txt").drop_duplicates()
df.show()

2.If you would like to target specific columnar position of the text file, create a new column for that position and apply the same process.

from pyspark.sql import SparkSession
from pyspark.sql.functions import *
spark = SparkSession.builder.getOrCreate()

df = df.withColumn("col1", col('value').substr(starting_position, length))
df.select("col1").drop_duplicates().show()