Convert Multiple Python Lines to a Concurrent DataFrame and Merge with Source Data-CodePudding

I apologize if this is a rudimentary question. I feel like it should be easy but I cannot figure it out. I have the code that is listed below that essentially looks at two columns in a CSV file and matches up job titles that have a similarity of 0.7. To do this, I use difflib.get_close_matches. However, the output is multiple single lines and whenever I try to convert to a DataFrame, every single line is its own DataFrame and I cannot figure out how to merge/concat them. All code, as well as current and desired outputs are below. Any help would be much appreciated.

Current Code is:

import pandas as pd
import difflib

df = pd.read_csv('name.csv')

aLists = list(df['JTs'])

bLists = list(df['JT'])

n=3
cutoff = 0.7

for aList in aLists:
    best = difflib.get_close_matches(aList, bLists, n, cutoff)
    print(best)

Current Output is:

['SW Engineer']
['Manu Engineer']
[]
['IT Help']

Desired Output is:

          Output
0    SW Engineer
1  Manu Engineer
2        (blank)
3        IT Help

The table I am attempting to do this one is:

Any help would be greatly appreciated!

CodePudding user response：

Here is a simple way to achieve this.I have converted first to a string.Then the first and last brackets are removed from that string and then is appended to a global list.

import pandas as pd
import difflib
import numpy as np

df = pd.read_csv('name.csv')

aLists = list(df['JTs'])

bLists = list(df['JT'])

n = 3
cutoff = 0.7

best = []

for aList in aLists:
    temp = difflib.get_close_matches(aList, bLists, n, cutoff)
    temp = str(temp)
    strippedString = temp.lstrip("[").rstrip("]")
    # print(temp)    
    best.append(strippedString)

print(best)

Output

[ 
  "'SW Engineer'", 
  "'Manu Engineer'", 
  '', 
  "'IT Help'"
]

Here is another better way to achieve this. You can simply use numpy to concatenate multiple arrays into single one.And then you can convert it to normal array if you want.

import pandas as pd
import difflib
import numpy as np

df = pd.read_csv('name.csv')
aLists = list(df['JTs'])
bLists = list(df['JT'])
n = 3
cutoff = 0.7
best = []
for aList in aLists:
    temp = difflib.get_close_matches(aList, bLists, n, cutoff)
    best.append(temp)

# print(best)
# Use concatenate() to join two arrays
combinedNumpyArray = np.concatenate(best)
#Converting numpy array to normal array
normalArray = combinedNumpyArray.tolist()
print(normalArray)

Output

['SW Engineer', 'Manu Engineer', 'IT Help']

Thanks

CodePudding user response：

You could use Panda's .apply() to run your function on each entry. This could then either be added as a new column or a new dataframe created.

For example:

import pandas as pd
import difflib


def get_best_match(word):
    matches = difflib.get_close_matches(word, JT, n, cutoff)
    return matches[0] if matches else None

df = pd.read_csv('name.csv')
JT = df['JT']

n = 3
cutoff = 0.7

df['Output'] = df['JTs'].apply(get_best_match)

Or for a new dataframe:

df_output = pd.DataFrame({'Output' : df['JTs'].apply(get_best_match)})

Giving you:

                      JTs             JT         Output
0       Software Engineer  Manu Engineer    SW Engineer
1  Manufacturing Engineer    SW Engineer  Manu Engineer
2  Human Resource Manager        IT Help           None
3            IT Help Desk              f        IT Help

Or:

          Output
0    SW Engineer
1  Manu Engineer
2           None
3        IT Help