split a dataframe by length of characters-CodePudding

I have a table like

--------------------|
Val
--------------------|
1, M A ,HELLO,WORLD |
2, M 1A,HELLO WORLD |
---------------------

I want to split the above dataframe so it contains 3 cols below.

----------------------
a | b   | c          |
----------------------
1 | M A | HELLO,WORLD|
1 | M 1A| HELLO WORLD|
----------------------

I have used the below code but does not work as expected. Is there a way to contain all characters after 5 charcters in col c etc and character 2-5 in col b.

df = df.withColumn('Splitted', F.split(hadf37dr_df['Val'], ',')).withColumn('a', F.col('Splitted')[0]).withColumn('b', F.col('Splitted')[1]).withColumn('c', F.col('Splitted')[2])

CodePudding user response：

You can use df.Val.str.extract(...) to split a string column into multiple columns. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html

If you are having this issue because the file you are reading is actually formatted in fixed-width instead of comma-separated for example you might want to use read_fwf(...) from the pandas library to avoid this problem in the first place. https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

CodePudding user response：

why split a dataframe by length of characters to achieve goal for splitting based on fisrt occurance of coma? if so

df = pd.DataFrame([i.split(',', 1) for i in df.Val], columns=['b', 'c'])

Gives #

   b             c
0  M A   HELLO,WORLD 
1  M 1A  HELLO WORLD

CodePudding user response：

if you want to this in pyspark you need F.concat_ws which concat elements of a list and return it as string and you need F.slice which slice elements of a list from 'head' with specified 'length'. because you need to set length you need size of array which you can have with F.size.

(
    df
    .withColumn('Splitted', F.split(df['val'], ','))
    .withColumn('a', F.col('Splitted')[0])
    .withColumn('b', F.col('Splitted')[1])
    .withColumn(
        'c', 
        F.concat_ws(
            ',',
            F.slice('Splitted', 3,  F.size('Splitted') - 2)))
).show()

 ------------------- -------------------- --- ----- ----------- 
|                val|            Splitted|  a|    b|          c|
 ------------------- -------------------- --- ----- ----------- 
|1, M A ,HELLO,WORLD|[1,  M A , HELLO,...|  1| M A |HELLO,WORLD|
|2, M 1A,HELLO WORLD|[2,  M 1A, HELLO ...|  2| M 1A|HELLO WORLD|
 ------------------- -------------------- --- ----- -----------

CodePudding user response：

If splitting the strings from your original table df with column 'Val' is just based on positions (as you write), you can slice them as follows to obtain df1

df1 = pd.DataFrame(columns = ['a', 'b', 'c'], 
                   data = [[row[0], row[2:7], row[8:]] for row in df.Val]])