Home > OS >  Extracting texts from a set and store into a dataframe column
Extracting texts from a set and store into a dataframe column

Time:12-16

I would like to extract strings from a column containing set in a pandas dataframe. The column looks like the below:

0      {s}
1      {B}
2      {m}
3      {H}
4      {b}
      ... 
295    {G}
296    {N}
297    {s}
298    {v}
299    {p}
Name: letters, Length: 300, dtype: object

when I use the str function to extract the texts and store in another column, the output looks like this:

0      0      {s}\n1      {B}\n2      {m}\n3      {H}...
1      0      {s}\n1      {B}\n2      {m}\n3      {H}...
2      0      {s}\n1      {B}\n2      {m}\n3      {H}...
3      0      {s}\n1      {B}\n2      {m}\n3      {H}...
4      0      {s}\n1      {B}\n2      {m}\n3      {H}...
                             ...                        
295    0      {s}\n1      {B}\n2      {m}\n3      {H}...
296    0      {s}\n1      {B}\n2      {m}\n3      {H}...
297    0      {s}\n1      {B}\n2      {m}\n3      {H}...
298    0      {s}\n1      {B}\n2      {m}\n3      {H}...
299    0      {s}\n1      {B}\n2      {m}\n3      {H}...
Name: str_val, Length: 300, dtype: object

if anyone can kindly help me explain why it gets converted like this?

letters is the column name of this set. I would like to create another column 'comm' which should look like the below:

0      s
1      B
2      m
3      H
4      b

and the datatype should be string. Any help is much appreciated.

CodePudding user response:

Use a list comprehension (faster than apply) with iter and next with None (or anything you want) as default value in case you have empty sets:

df['letter'] = [next(iter(s), None) for s in df['set']]

Example:

   set letter
0  {s}      s
1  {B}      B
2  {m}      m
3  {H}      H
4  {b}      b
5   {}   None

Used input:

df = pd.DataFrame({'set': [{'s'}, {'B'}, {'m'}, {'H'}, {'b'}, {}]})

CodePudding user response:

df["comm"] = df["letters"].apply(lambda x: x.pop())

Explanation: apply iterates through each row in the letters column, running the lambda function specified, and returning a series comprised of each value the lambda function returns. The lambda function in this case pops an element out of the set found in each row. In this case, since each row is a set of one element, .pop() will work for your use case.

CodePudding user response:

It seems you are converting the whole dataframe into a string for each row. You can get the whole column using:

str_val["LettersColumn"] = letters["LettersColumn"]

You should change "LettersColumn" to the names of your columns of course.

  • Related