How to split text in various columns-CodePudding

I have this CSV file (in spreadsheet view)

Jobs starts after the first *

Job Group * Job1 * Job 2 * Job3

Job Group
Administrative law judges, adjudicators, and hearing officers * Appeals Examiner* Appeals Referee* Appellate Conferee* Hearing Examiner* Hearing Officer* Justice of The Peace* Traffic Court Referee
Agricultural engineers * Agricultural Engineer* Agricultural Production Engineer* Agricultural Research Engineer* Farm Equipment Engineer* Research Agricultural Engineer
Advertising sales agents * Advertising Account Executive* Advertising Agent* Advertising Solicitor* Display Advertising Sales Representative* Inside Sales Advertising Executive* Outside Sales Advertising Executive* Radio Time Salesperson* Signs and Displays Salesperson* Yellow Pages Space Salesperson

I want to separate them into columns, but the number of job varies per group.

Administrative: 7 jobs
Agricultural: 5 jobs
Advertising: 9 jobs

Here are what I have tried:

df['Job Group'].str.split('*') 
   # This works but there is a leading whitespace


df['Job Group'].str.split('* ') 
   # Error: nothing to repeat at position 0

This is the problem, this is what I want where the text would be separated into columns, but:

df[['col', 'row', 'be']] = df['Job Group'].str.split('*')
   # Error: Columns must be same length as key

How can I separate the text where I would use star and space as a delimiter '* ', and create a column for those separated text?

CodePudding user response：

Looks like this may help: Link I'm not the most experienced in this area of python, and personally think arrays are annoying because of this. This is the splitting process, and collecting it though. To open and read the file is a simple process with many easy YT tutorials.

Hope this helped, again I don't know the most about this stuff but hopefully I hit some issues for you!

CodePudding user response：

You can use DataFrame.explode after you use str.split:

import pandas as pd

df = pd.DataFrame([
    ['Administrative law judges, adjudicators, and hearing officers * Appeals Examiner* Appeals Referee* Appellate Conferee* Hearing Examiner* Hearing Officer* Justice of The Peace* Traffic Court Referee'],
    ['Agricultural engineers * Agricultural Engineer* Agricultural Production Engineer* Agricultural Research Engineer* Farm Equipment Engineer* Research Agricultural Engineer'],
    ['Advertising sales agents * Advertising Account Executive* Advertising Agent* Advertising Solicitor* Display Advertising Sales Representative* Inside Sales Advertising Executive* Outside Sales Advertising Executive* Radio Time Salesperson* Signs and Displays Salesperson* Yellow Pages Space Salesperson']
],
    index=['administrative', 'agricultural','advertising'],
    columns=['job_group'])

df['job_group'] = df['job_group'].str.split('*')
df2 = df.explode('job_group')

print(df2)

	job_group
administrative	Administrative law judges, adjudicators, and h...
administrative	Appeals Examiner
administrative	Appeals Referee
administrative	Appellate Conferee
administrative	Hearing Examiner
administrative	Hearing Officer
administrative	Justice of The Peace
administrative	Traffic Court Referee
agricultural	Agricultural engineers
agricultural	Agricultural Engineer
agricultural	Agricultural Production Engineer
agricultural	Agricultural Research Engineer
agricultural	Farm Equipment Engineer
agricultural	Research Agricultural Engineer
advertising	Advertising sales agents
advertising	Advertising Account Executive
advertising	Advertising Agent
advertising	Advertising Solicitor
advertising	Display Advertising Sales Representative
advertising	Inside Sales Advertising Executive
advertising	Outside Sales Advertising Executive
advertising	Radio Time Salesperson
advertising	Signs and Displays Salesperson
advertising	Yellow Pages Space Salesperson

Then you can use str.split to remove leading spaces:

df2['job_group'] = df2['job_goup'].str.strip()