Is it possible to split a string from a dataframe column based on a word in a list? For example: There is a dataframe with a column 'Company' which includes company names, a legal form and sometimes something after the legal form like electronics.
Company |
---|
XYZ ltd electronics |
ABC AB inc iron |
AB XY Z inc |
CD EF GHI JK llc chicago |
On the other hand I have list with 1500 world wide legal forms of companys (inc, ltd, ...) Is it possible to split the string in the column like this, based on the strings in the list? With other words seperate everything after the words in the list to a new column.
Company | Legal form | Addition |
---|---|---|
XYZ | ltd | electronics |
ABC AB | inc | iron |
AB XY Z | inc | |
CD EF GHI JK | llc | chicago |
or at least
Company | Addtition |
---|---|
XYZ | ltd electronics |
ABC AB | inc iron |
AB XY Z | inc |
CD EF GHI JK | llc chicago |
I look forward to your help!
CodePudding user response:
Assuming you are just trying to string split after spaces you could try something like this:
df['Addition']= df['Company'].str.split()
CodePudding user response:
I think you can use jieba package
CodePudding user response:
You could use regular expression to filter out the legal names. Try this:
import pandas as pd
import re
df = pd.DataFrame({'Company': ['XYZ ltd electronics', 'ABC AB inc iron', 'AB XY Z inc', 'CD EF GHI JK llc chicago']}, columns=['Company'])
df['Coy']= df['Company'].apply(lambda x: re.split('(ltd|inc|llc)', x))
print(df)
This will create a list for each company name, separated by the legal form
Company Coy
0 XYZ ltd electronics [XYZ , ltd, electronics]
1 ABC AB inc iron [ABC AB , inc, iron]
2 AB XY Z inc [AB XY Z , inc, ]
3 CD EF GHI JK llc chicago [CD EF GHI JK , llc, chicago]
After that you can split them into 3 separate columns:
df1 = pd.DataFrame(df['Coy'].tolist(), columns=['Company', 'Legal form', 'Addition'])
print(df1)
Output:
Company Legal form Addition
0 XYZ ltd electronics
1 ABC AB inc iron
2 AB XY Z inc
3 CD EF GHI JK llc chicago