Home > OS >  split a dataframe by length of characters
split a dataframe by length of characters

Time:12-13

I have a table like

--------------------|
Val
--------------------|
1, M A ,HELLO,WORLD |
2, M 1A,HELLO WORLD |
---------------------

I want to split the above dataframe so it contains 3 cols below.

----------------------
a | b   | c          |
----------------------
1 | M A | HELLO,WORLD|
1 | M 1A| HELLO WORLD|
----------------------

I have used the below code but does not work as expected. Is there a way to contain all characters after 5 charcters in col c etc and character 2-5 in col b.

df = df.withColumn('Splitted', F.split(hadf37dr_df['Val'], ',')).withColumn('a', F.col('Splitted')[0]).withColumn('b', F.col('Splitted')[1]).withColumn('c', F.col('Splitted')[2])

CodePudding user response:

You can use df.Val.str.extract(...) to split a string column into multiple columns. https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html

If you are having this issue because the file you are reading is actually formatted in fixed-width instead of comma-separated for example you might want to use read_fwf(...) from the pandas library to avoid this problem in the first place. https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html

CodePudding user response:

why split a dataframe by length of characters to achieve goal for splitting based on fisrt occurance of coma? if so

df = pd.DataFrame([i.split(',', 1) for i in df.Val], columns=['b', 'c'])

Gives #

   b             c
0  M A   HELLO,WORLD 
1  M 1A  HELLO WORLD 

CodePudding user response:

if you want to this in pyspark you need F.concat_ws which concat elements of a list and return it as string and you need F.slice which slice elements of a list from 'head' with specified 'length'. because you need to set length you need size of array which you can have with F.size.

(
    df
    .withColumn('Splitted', F.split(df['val'], ','))
    .withColumn('a', F.col('Splitted')[0])
    .withColumn('b', F.col('Splitted')[1])
    .withColumn(
        'c', 
        F.concat_ws(
            ',',
            F.slice('Splitted', 3,  F.size('Splitted') - 2)))
).show()

 ------------------- -------------------- --- ----- ----------- 
|                val|            Splitted|  a|    b|          c|
 ------------------- -------------------- --- ----- ----------- 
|1, M A ,HELLO,WORLD|[1,  M A , HELLO,...|  1| M A |HELLO,WORLD|
|2, M 1A,HELLO WORLD|[2,  M 1A, HELLO ...|  2| M 1A|HELLO WORLD|
 ------------------- -------------------- --- ----- ----------- 

CodePudding user response:

If splitting the strings from your original table df with column 'Val' is just based on positions (as you write), you can slice them as follows to obtain df1

df1 = pd.DataFrame(columns = ['a', 'b', 'c'], 
                   data = [[row[0], row[2:7], row[8:]] for row in df.Val]])
  • Related