I have a table like
--------------------|
Val
--------------------|
1, M A ,HELLO,WORLD |
2, M 1A,HELLO WORLD |
---------------------
I want to split the above dataframe so it contains 3 cols below.
----------------------
a | b | c |
----------------------
1 | M A | HELLO,WORLD|
1 | M 1A| HELLO WORLD|
----------------------
I have used the below code but does not work as expected. Is there a way to contain all characters after 5 charcters in col c etc and character 2-5 in col b.
df = df.withColumn('Splitted', F.split(hadf37dr_df['Val'], ',')).withColumn('a', F.col('Splitted')[0]).withColumn('b', F.col('Splitted')[1]).withColumn('c', F.col('Splitted')[2])
CodePudding user response:
You can use df.Val.str.extract(...)
to split a string column into multiple columns.
https://pandas.pydata.org/docs/reference/api/pandas.Series.str.extract.html
If you are having this issue because the file you are reading is actually formatted in fixed-width instead of comma-separated for example you might want to use read_fwf(...)
from the pandas library to avoid this problem in the first place.
https://pandas.pydata.org/docs/reference/api/pandas.read_fwf.html
CodePudding user response:
why split a dataframe by length of characters
to achieve goal for splitting based on fisrt occurance of coma? if so
df = pd.DataFrame([i.split(',', 1) for i in df.Val], columns=['b', 'c'])
Gives #
b c
0 M A HELLO,WORLD
1 M 1A HELLO WORLD
CodePudding user response:
if you want to this in pyspark you need F.concat_ws which concat elements of a list and return it as string and you need F.slice which slice elements of a list from 'head' with specified 'length'. because you need to set length you need size of array which you can have with F.size.
(
df
.withColumn('Splitted', F.split(df['val'], ','))
.withColumn('a', F.col('Splitted')[0])
.withColumn('b', F.col('Splitted')[1])
.withColumn(
'c',
F.concat_ws(
',',
F.slice('Splitted', 3, F.size('Splitted') - 2)))
).show()
------------------- -------------------- --- ----- -----------
| val| Splitted| a| b| c|
------------------- -------------------- --- ----- -----------
|1, M A ,HELLO,WORLD|[1, M A , HELLO,...| 1| M A |HELLO,WORLD|
|2, M 1A,HELLO WORLD|[2, M 1A, HELLO ...| 2| M 1A|HELLO WORLD|
------------------- -------------------- --- ----- -----------
CodePudding user response:
If splitting the strings from your original table df with column 'Val' is just based on positions (as you write), you can slice them as follows to obtain df1
df1 = pd.DataFrame(columns = ['a', 'b', 'c'],
data = [[row[0], row[2:7], row[8:]] for row in df.Val]])