I have a text file which has "|~" as delimiter. How can Iremove it while loading the text file as a dataframe in pyspark?
CodePudding user response:
One way is to convert the txt to csv file and read in pyspark.
import pandas as pd
name = []
age = []
gender = []
with open('test.txt','r') as file:
for i,line in enumerate(file.readlines()):
if i == 0:
continue
name.append(line.split('|')[0].replace('~','').strip())
age.append(line.split('|')[1].replace('~','').strip())
gender.append(line.split('|')[2].replace('~','').strip())
df= pd.DataFrame()
df['name'] = name
df['age'] = age
df['gender'] = gender
spark_df = spark.createDataFrame(df)
spark_df.show()
CodePudding user response:
Can't you just load it and use |~
as delimiter?
Like this:
df = (
spark.read
.option("delimiter", "|~")
.option("header", True)
.format("csv")
.load("file.txt")
)
df.show()
# output
----- --- ------
| name|age|gender|
----- --- ------
|rakhi| 24| M|
|Sujal| 23| F|
----- --- ------