Split column/ string after specific words in list Python-CodePudding

Is it possible to split a string from a dataframe column based on a word in a list? For example: There is a dataframe with a column 'Company' which includes company names, a legal form and sometimes something after the legal form like electronics.

Company
XYZ ltd electronics
ABC AB inc iron
AB XY Z inc
CD EF GHI JK llc chicago

On the other hand I have list with 1500 world wide legal forms of companys (inc, ltd, ...) Is it possible to split the string in the column like this, based on the strings in the list? With other words seperate everything after the words in the list to a new column.

Company	Legal form	Addition
XYZ	ltd	electronics
ABC AB	inc	iron
AB XY Z	inc
CD EF GHI JK	llc	chicago

or at least

Company	Addtition
XYZ	ltd electronics
ABC AB	inc iron
AB XY Z	inc
CD EF GHI JK	llc chicago

I look forward to your help!

CodePudding user response：

Assuming you are just trying to string split after spaces you could try something like this:

df['Addition']= df['Company'].str.split()

CodePudding user response：

I think you can use jieba package

CodePudding user response：

You could use regular expression to filter out the legal names. Try this:

import pandas as pd
import re

df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC AB inc iron', 'AB XY Z inc', 'CD EF GHI JK llc chicago']}, columns=['Company'])
df['Coy']= df['Company'].apply(lambda x: re.split('(ltd|inc|llc)', x))
print(df)

This will create a list for each company name, separated by the legal form

                    Company                             Coy
0       XYZ ltd electronics       [XYZ , ltd,  electronics]
1           ABC AB inc iron           [ABC AB , inc,  iron]
2               AB XY Z inc               [AB XY Z , inc, ]
3  CD EF GHI JK llc chicago  [CD EF GHI JK , llc,  chicago]

After that you can split them into 3 separate columns:

df1 = pd.DataFrame(df['Coy'].tolist(), columns=['Company', 'Legal form', 'Addition'])
print(df1)

Output:

         Company Legal form      Addition
0           XYZ         ltd   electronics
1        ABC AB         inc          iron
2       AB XY Z         inc              
3  CD EF GHI JK         llc       chicago