I'm trying to read a text file into a PySpark dataframe. The text file has a varying amount of spaces. So a row could be something like:
Ryan A. Smith>>>Welder>>>>>>3200 Smith Street>>>>>99999
With spaces instead of arrows.
I need to delimit this, but I don't necessarily know the command to. I know they are separated always by at least 2 spaces, so regex seems perfect. However, I can't find a way to do this in PySpark.
CodePudding user response:
We can try using split
here to generate the columns you want:
df_new = df.withColumn('name', split(df['col'], '> ').getItem(0))
.withColumn('occupation', split(df['col'], '> ').getItem(1))
.withColumn('address', split(df['col'], '> ').getItem(2))
.withColumn('number', split(df['col'], '> ').getItem(3))
This assumes that the current text you showed above in a column named col
.
CodePudding user response:
You could try first creating a list with column names and then applying split
, as it accepts a regex pattern as a delimiter.
from pyspark.sql import functions as F
cols = ['name', 'job', 'address', 'id']
df = df.select(
[F.split('col_name', ' {2,}')[i].alias(c) for i, c in enumerate(cols)]
)