Home > other >  Splitting a pandas dataframe values in a columns?
Splitting a pandas dataframe values in a columns?

Time:03-21

I have a df that a columns with some ID for companies. How can I split this ID in columns?

In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?

Here is an example of the column:

0                                                  4773300
1                                                      NaN
2                                  6201501,6319400,6202300
3                                                  8230001
4                                                      NaN
5           4742300,4744004,4744003,7319002,4729699,475470

The division would be at each comma, I imagine an output like this:

columnA columnB columnC
4773300 Nan Nan
NaN Nan Nan
6201501 6319400 6202300
8230001 Nan Nan

And so on depending on the number of IDs

CodePudding user response:

You can use the .str.split method to perform this type of transformation quite readily. The trick is to pass the expand=True parameter so your results are put into a DataFrame instead of a Series containing list objects.

>>> df
                                               ID
0                                         4773300
1                                             NaN
2                         6201501,6319400,6202300
3                                         8230001
4                                             NaN
5  4742300,4744004,4744003,7319002,4729699,475470


>>> df['ID'].str.split(',', expand=True)
         0        1        2        3        4       5
0  4773300     None     None     None     None    None
1      NaN      NaN      NaN      NaN      NaN     NaN
2  6201501  6319400  6202300     None     None    None
3  8230001     None     None     None     None    None
4      NaN      NaN      NaN      NaN      NaN     NaN
5  4742300  4744004  4744003  7319002  4729699  475470

You can also clean up the output a little for better aesthetics

  • replace None for NaN
  • alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
  • join back to original DataFrame
>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
    df['ID'].str.split(',', expand=True)
    .replace({None: float('nan')})
    .pipe(lambda d: 
        d.set_axis(
            pd.Series(list(ascii_uppercase))[d.columns], 
            axis=1
        )
    )
    .add_prefix("column")
    .join(df)
)

   columnA  columnB  columnC  columnD  columnE columnF                                              ID
0  4773300      NaN      NaN      NaN      NaN     NaN                                         4773300
1      NaN      NaN      NaN      NaN      NaN     NaN                                             NaN
2  6201501  6319400  6202300      NaN      NaN     NaN                         6201501,6319400,6202300
3  8230001      NaN      NaN      NaN      NaN     NaN                                         8230001
4      NaN      NaN      NaN      NaN      NaN     NaN                                             NaN
5  4742300  4744004  4744003  7319002  4729699  475470  4742300,4744004,4744003,7319002,4729699,475470

CodePudding user response:

Consider each entry as a string, and parse the string to get to individual values.

from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})

words = []
for items in df['company']:
    for word in items:
        words.append(word)

FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.

  • Related