I have a dataframe like as shown below
ID,Name,year,output
1,Test Level,2021,1
2,Test Lvele,2022,1
2,dummy Inc,2022,1
2,dummy Pvt Inc,2022,1
3,dasho Ltd,2022,1
4,dasho PVT Ltd,2021,0
5,delphi Ltd,2021,1
6,delphi pvt ltd,2021,1
df = pd.read_clipboard(sep=',')
My objective is
a) To replace near duplicate strings using a common string.
For example - let's pick couple of strings from Name
column. We have dummy Inc
and dummy Pvt Inc
. These both have to be replaced as dummy
I manually prepared a mapping df map_df
like as below (but can't do this for big data)
Name,correct_name
Test Level,Test
Test Lvele,Test
dummy Inc,dummy
dummy Pvt Inc,dummy
dasho Ltd,dasho
dasho PVT Ltd,dasho
delphi Ltd,delphi
delphi pvt ltd,delphi
So, I tried the below
map_df = map_df.set_index(Name)
df['Name'] = df['Name'].map(map_df) # but this doesn't work and throws error
Is creating mapping table the only way or is there any NLP based approach?
I expect my output to be like as below
ID,Name,year,output
1,Test,2021,1
2,Test,2022,1
2,dummy,2022,1
2,dummy,2022,1
3,dasho,2022,1
4,dasho,2021,0
5,delphi,2021,1
6,delphi,2021,1
CodePudding user response:
Hope this helps!
df['Name'] = [name.split()[0] for name in df['Name']]
CodePudding user response:
I suggest using a dict
instead of a pandas.DataFrame
for map_df
.
ID,Name,year,output
1,Test Level,2021,1
2,Test Lvele,2022,1
2,dummy Inc,2022,1
2,dummy Pvt Inc,2022,1
3,dasho Ltd,2022,1
4,dasho PVT Ltd,2021,0
5,delphi Ltd,2021,1
6,delphi pvt ltd,2021,1
df = pd.read_clipboard(sep=',')
map_dict = dict(s.strip().split(',') for s in ''' Test Level,Test
Test Lvele,Test
dummy Inc,dummy
dummy Pvt Inc,dummy
dasho Ltd,dasho
dasho PVT Ltd,dasho
delphi Ltd,delphi
delphi pvt ltd,delphi'''.split('\n'))
df['Name'] = df['Name'].map(map_dict.get)
Results:
df.to_clipboard(sep=',')
,ID,Name,year,output
0,1,Test,2021,1
1,2,Test,2022,1
2,2,dummy,2022,1
3,2,dummy,2022,1
4,3,dasho,2022,1
5,4,dasho,2021,0
6,5,delphi,2021,1
7,6,delphi,2021,1