Splitting a pandas dataframe values in a columns?-CodePudding

I have a df that a columns with some ID for companies. How can I split this ID in columns?

In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?

Here is an example of the column:

0                                                  4773300
1                                                      NaN
2                                  6201501,6319400,6202300
3                                                  8230001
4                                                      NaN
5           4742300,4744004,4744003,7319002,4729699,475470

The division would be at each comma, I imagine an output like this:

columnA	columnB	columnC
4773300	Nan	Nan
NaN	Nan	Nan
6201501	6319400	6202300
8230001	Nan	Nan

And so on depending on the number of IDs

CodePudding user response：

You can use the .str.split method to perform this type of transformation quite readily. The trick is to pass the expand=True parameter so your results are put into a DataFrame instead of a Series containing list objects.

>>> df
                                               ID
0                                         4773300
1                                             NaN
2                         6201501,6319400,6202300
3                                         8230001
4                                             NaN
5  4742300,4744004,4744003,7319002,4729699,475470


>>> df['ID'].str.split(',', expand=True)
         0        1        2        3        4       5
0  4773300     None     None     None     None    None
1      NaN      NaN      NaN      NaN      NaN     NaN
2  6201501  6319400  6202300     None     None    None
3  8230001     None     None     None     None    None
4      NaN      NaN      NaN      NaN      NaN     NaN
5  4742300  4744004  4744003  7319002  4729699  475470

You can also clean up the output a little for better aesthetics

replace None for NaN
alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
join back to original DataFrame

>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
    df['ID'].str.split(',', expand=True)
    .replace({None: float('nan')})
    .pipe(lambda d: 
        d.set_axis(
            pd.Series(list(ascii_uppercase))[d.columns], 
            axis=1
        )
    )
    .add_prefix("column")
    .join(df)
)

   columnA  columnB  columnC  columnD  columnE columnF                                              ID
0  4773300      NaN      NaN      NaN      NaN     NaN                                         4773300
1      NaN      NaN      NaN      NaN      NaN     NaN                                             NaN
2  6201501  6319400  6202300      NaN      NaN     NaN                         6201501,6319400,6202300
3  8230001      NaN      NaN      NaN      NaN     NaN                                         8230001
4      NaN      NaN      NaN      NaN      NaN     NaN                                             NaN
5  4742300  4744004  4744003  7319002  4729699  475470  4742300,4744004,4744003,7319002,4729699,475470

CodePudding user response：

Consider each entry as a string, and parse the string to get to individual values.

from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})

words = []
for items in df['company']:
    for word in items:
        words.append(word)

FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.