Delimiting pyspark .read.text() with regex-CodePudding

I'm trying to read a text file into a PySpark dataframe. The text file has a varying amount of spaces. So a row could be something like:

Ryan A. Smith>>>Welder>>>>>>3200 Smith Street>>>>>99999

With spaces instead of arrows.

I need to delimit this, but I don't necessarily know the command to. I know they are separated always by at least 2 spaces, so regex seems perfect. However, I can't find a way to do this in PySpark.

CodePudding user response：

We can try using split here to generate the columns you want:

df_new = df.withColumn('name', split(df['col'], '> ').getItem(0))
           .withColumn('occupation', split(df['col'], '> ').getItem(1))
           .withColumn('address', split(df['col'], '> ').getItem(2))
           .withColumn('number', split(df['col'], '> ').getItem(3))

This assumes that the current text you showed above in a column named col.

CodePudding user response：

You could try first creating a list with column names and then applying split, as it accepts a regex pattern as a delimiter.

from pyspark.sql import functions as F

cols = ['name', 'job', 'address', 'id']
df = df.select(
    [F.split('col_name', ' {2,}')[i].alias(c) for i, c in enumerate(cols)]
)