I am trying to work with lists in Pandas cells. Not the best idea, maybe. Still I find that the pandas.Series.str.len method does not work as shown in the documentation.
Here goes my code:
df3=pd.DataFrame({"comas":
[
"1,2,3",
"4,5,6,7,8,9",
"7,8",
"9,10,11,12"
]
}
)
df3["split"]=df3["comas"].str.split()
From there I obtain the following dataframe, as expected:
comas split
0 1,2,3 [1,2,3]
1 4,5,6,7,8,9 [4,5,6,7,8,9]
2 7,8 [7,8]
3 9,10,11,12 [9,10,11,12]
Now I want to know the length of each list.
df3["split"].str.len()
and I get
0 1
1 1
2 1
3 1
Name: split, dtype: int64
What I see in the documentation is that
s = pd.Series(['dog',
'',
5,
{'foo' : 'bar'},
[2, 3, 5, 7],
('one', 'two', 'three')])
s
0 dog
1
2 5
3 {'foo': 'bar'}
4 [2, 3, 5, 7]
5 (one, two, three)
dtype: object
s.str.len()
0 3.0
1 0.0
2 NaN
3 1.0
4 4.0
5 3.0
dtype: float64
Can somebody explain to me what is the difference between the list item in the fourth element of the example series and my series? I am using pandas version 1.3.3
Thank you in advance!
Edit:
RoseGod is right, I did not include the separator in the split. I was confused because Jupyter shows the elements in the dataframe pretty much the same way regardless if it was actually separated or not
df3.loc[0]
comas 1,2,3
split [1,2,3]
split_separator [1, 2, 3]
Name: 0, dtype: object
It does show the element as a single string if I get a single cell, though:
df3.loc[0,"split"]
['1,2,3']
df3.loc[0,"split_separator"]
['1', '2', '3']
CodePudding user response:
In the split you should specify how you want to split:
df3["split"]=df3["comas"].str.split(',')
Then the output will be the shape you want:
df3["split"].str.len()
0 3
1 6
2 2
3 4
Name: split, dtype: int64
If you print the value in the first row after you use the split without specifying the delimiter this is the output:
['1,2,3']
you can see it didn't really split it into a list but just created a list that contains the string value.