I have this CSV file (in spreadsheet view)
Jobs starts after the first *
Job Group * Job1 * Job 2 * Job3
Job Group |
---|
Administrative law judges, adjudicators, and hearing officers * Appeals Examiner* Appeals Referee* Appellate Conferee* Hearing Examiner* Hearing Officer* Justice of The Peace* Traffic Court Referee |
Agricultural engineers * Agricultural Engineer* Agricultural Production Engineer* Agricultural Research Engineer* Farm Equipment Engineer* Research Agricultural Engineer |
Advertising sales agents * Advertising Account Executive* Advertising Agent* Advertising Solicitor* Display Advertising Sales Representative* Inside Sales Advertising Executive* Outside Sales Advertising Executive* Radio Time Salesperson* Signs and Displays Salesperson* Yellow Pages Space Salesperson |
I want to separate them into columns, but the number of job varies per group.
Administrative: 7 jobs
Agricultural: 5 jobs
Advertising: 9 jobs
Here are what I have tried:
df['Job Group'].str.split('*')
# This works but there is a leading whitespace
df['Job Group'].str.split('* ')
# Error: nothing to repeat at position 0
This is the problem, this is what I want where the text would be separated into columns, but:
df[['col', 'row', 'be']] = df['Job Group'].str.split('*')
# Error: Columns must be same length as key
How can I separate the text where I would use star and space as a delimiter '* '
, and create a column for those separated text?
CodePudding user response:
Looks like this may help: Link I'm not the most experienced in this area of python, and personally think arrays are annoying because of this. This is the splitting process, and collecting it though. To open and read the file is a simple process with many easy YT tutorials.
Hope this helped, again I don't know the most about this stuff but hopefully I hit some issues for you!
CodePudding user response:
You can use DataFrame.explode
after you use str.split
:
import pandas as pd
df = pd.DataFrame([
['Administrative law judges, adjudicators, and hearing officers * Appeals Examiner* Appeals Referee* Appellate Conferee* Hearing Examiner* Hearing Officer* Justice of The Peace* Traffic Court Referee'],
['Agricultural engineers * Agricultural Engineer* Agricultural Production Engineer* Agricultural Research Engineer* Farm Equipment Engineer* Research Agricultural Engineer'],
['Advertising sales agents * Advertising Account Executive* Advertising Agent* Advertising Solicitor* Display Advertising Sales Representative* Inside Sales Advertising Executive* Outside Sales Advertising Executive* Radio Time Salesperson* Signs and Displays Salesperson* Yellow Pages Space Salesperson']
],
index=['administrative', 'agricultural','advertising'],
columns=['job_group'])
df['job_group'] = df['job_group'].str.split('*')
df2 = df.explode('job_group')
print(df2)
job_group | |
---|---|
administrative | Administrative law judges, adjudicators, and h... |
administrative | Appeals Examiner |
administrative | Appeals Referee |
administrative | Appellate Conferee |
administrative | Hearing Examiner |
administrative | Hearing Officer |
administrative | Justice of The Peace |
administrative | Traffic Court Referee |
agricultural | Agricultural engineers |
agricultural | Agricultural Engineer |
agricultural | Agricultural Production Engineer |
agricultural | Agricultural Research Engineer |
agricultural | Farm Equipment Engineer |
agricultural | Research Agricultural Engineer |
advertising | Advertising sales agents |
advertising | Advertising Account Executive |
advertising | Advertising Agent |
advertising | Advertising Solicitor |
advertising | Display Advertising Sales Representative |
advertising | Inside Sales Advertising Executive |
advertising | Outside Sales Advertising Executive |
advertising | Radio Time Salesperson |
advertising | Signs and Displays Salesperson |
advertising | Yellow Pages Space Salesperson |
Then you can use str.split
to remove leading spaces:
df2['job_group'] = df2['job_goup'].str.strip()