I have a df that a columns with some ID for companies. How can I split this ID in columns?
In this column the values can be 0(NaN) to more than 5 IDs, how to divide each one of them in separate columns?
Here is an example of the column:
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
The division would be at each comma, I imagine an output like this:
columnA | columnB | columnC |
---|---|---|
4773300 | Nan | Nan |
NaN | Nan | Nan |
6201501 | 6319400 | 6202300 |
8230001 | Nan | Nan |
And so on depending on the number of IDs
CodePudding user response:
You can use the .str.split
method to perform this type of transformation quite readily. The trick is to pass the expand=True
parameter so your results are put into a DataFrame instead of a Series containing list objects.
>>> df
ID
0 4773300
1 NaN
2 6201501,6319400,6202300
3 8230001
4 NaN
5 4742300,4744004,4744003,7319002,4729699,475470
>>> df['ID'].str.split(',', expand=True)
0 1 2 3 4 5
0 4773300 None None None None None
1 NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 None None None
3 8230001 None None None None None
4 NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470
You can also clean up the output a little for better aesthetics
- replace None for NaN
- alphabetic column names (though I would opt to not do this as you'll hit errors if a given entry in the ID column has > 26 ids in it.)
- join back to original DataFrame
>>> import pandas as pd
>>> from string import ascii_uppercase
>>> (
df['ID'].str.split(',', expand=True)
.replace({None: float('nan')})
.pipe(lambda d:
d.set_axis(
pd.Series(list(ascii_uppercase))[d.columns],
axis=1
)
)
.add_prefix("column")
.join(df)
)
columnA columnB columnC columnD columnE columnF ID
0 4773300 NaN NaN NaN NaN NaN 4773300
1 NaN NaN NaN NaN NaN NaN NaN
2 6201501 6319400 6202300 NaN NaN NaN 6201501,6319400,6202300
3 8230001 NaN NaN NaN NaN NaN 8230001
4 NaN NaN NaN NaN NaN NaN NaN
5 4742300 4744004 4744003 7319002 4729699 475470 4742300,4744004,4744003,7319002,4729699,475470
CodePudding user response:
Consider each entry as a string, and parse the string to get to individual values.
from ast import literal_eval
df = pd.read_csv('sample.csv', converters={'company': literal_eval})
words = []
for items in df['company']:
for word in items:
words.append(word)
FYI, This is a good starting point. I do not know what is intended output format needed as of now, since your question is kind of incomplete.