Home > OS >  How to split text in various columns
How to split text in various columns

Time:03-08

I have this CSV file (in spreadsheet view)

Jobs starts after the first *

Job Group * Job1 * Job 2 * Job3

Job Group
Administrative law judges, adjudicators, and hearing officers * Appeals Examiner* Appeals Referee* Appellate Conferee* Hearing Examiner* Hearing Officer* Justice of The Peace* Traffic Court Referee
Agricultural engineers * Agricultural Engineer* Agricultural Production Engineer* Agricultural Research Engineer* Farm Equipment Engineer* Research Agricultural Engineer
Advertising sales agents * Advertising Account Executive* Advertising Agent* Advertising Solicitor* Display Advertising Sales Representative* Inside Sales Advertising Executive* Outside Sales Advertising Executive* Radio Time Salesperson* Signs and Displays Salesperson* Yellow Pages Space Salesperson

I want to separate them into columns, but the number of job varies per group.

Administrative: 7 jobs
Agricultural: 5 jobs
Advertising: 9 jobs

Here are what I have tried:

df['Job Group'].str.split('*') 
   # This works but there is a leading whitespace


df['Job Group'].str.split('* ') 
   # Error: nothing to repeat at position 0

This is the problem, this is what I want where the text would be separated into columns, but:

df[['col', 'row', 'be']] = df['Job Group'].str.split('*')
   # Error: Columns must be same length as key

How can I separate the text where I would use star and space as a delimiter '* ', and create a column for those separated text?

CodePudding user response:

Looks like this may help: Link I'm not the most experienced in this area of python, and personally think arrays are annoying because of this. This is the splitting process, and collecting it though. To open and read the file is a simple process with many easy YT tutorials.

Hope this helped, again I don't know the most about this stuff but hopefully I hit some issues for you!

CodePudding user response:

You can use DataFrame.explode after you use str.split:

import pandas as pd

df = pd.DataFrame([
    ['Administrative law judges, adjudicators, and hearing officers * Appeals Examiner* Appeals Referee* Appellate Conferee* Hearing Examiner* Hearing Officer* Justice of The Peace* Traffic Court Referee'],
    ['Agricultural engineers * Agricultural Engineer* Agricultural Production Engineer* Agricultural Research Engineer* Farm Equipment Engineer* Research Agricultural Engineer'],
    ['Advertising sales agents * Advertising Account Executive* Advertising Agent* Advertising Solicitor* Display Advertising Sales Representative* Inside Sales Advertising Executive* Outside Sales Advertising Executive* Radio Time Salesperson* Signs and Displays Salesperson* Yellow Pages Space Salesperson']
],
    index=['administrative', 'agricultural','advertising'],
    columns=['job_group'])

df['job_group'] = df['job_group'].str.split('*')
df2 = df.explode('job_group')

print(df2)
job_group
administrative Administrative law judges, adjudicators, and h...
administrative Appeals Examiner
administrative Appeals Referee
administrative Appellate Conferee
administrative Hearing Examiner
administrative Hearing Officer
administrative Justice of The Peace
administrative Traffic Court Referee
agricultural Agricultural engineers
agricultural Agricultural Engineer
agricultural Agricultural Production Engineer
agricultural Agricultural Research Engineer
agricultural Farm Equipment Engineer
agricultural Research Agricultural Engineer
advertising Advertising sales agents
advertising Advertising Account Executive
advertising Advertising Agent
advertising Advertising Solicitor
advertising Display Advertising Sales Representative
advertising Inside Sales Advertising Executive
advertising Outside Sales Advertising Executive
advertising Radio Time Salesperson
advertising Signs and Displays Salesperson
advertising Yellow Pages Space Salesperson

Then you can use str.split to remove leading spaces:

df2['job_group'] = df2['job_goup'].str.strip()
  • Related