Home > Software engineering >  Combining values in a tuple that contain both string and integers values and storing as a dataframe
Combining values in a tuple that contain both string and integers values and storing as a dataframe

Time:11-07

I'm working on building a function that takes in a list of company descriptions and then outputs the most common 3-word phrases found in the list. I've been able to get it to the point where it outputs a dictionary of tuples constructed like this:

{('technology', 'company', 'provides'): 2,
 ('various', 'industries.', 'company'): 2,
 ('provides', 'software', 'solutions'): 2,
 ('life', 'health', 'insurance'): 2,...}

I'd like to convert this to a table/dataframe that concatenates the strings into a single value and then creates a separate column that would store the number of instances of the phrase.

The ideal output would be:

Phrase Occurrence
technology company provides 2
various industries company 2
provides software solutions 2
life health insurance 2

I've tried using the following which combines the tuple into a string but it drops the number of occurrences:

# function that converts tuple to string
def join_tuple_string(descriptions) -> str:
   return ' '.join(descriptions)

# joining all the tuples
result = map(join_tuple_string, descriptions)

# converting and printing the result
print(list(result))

Here is the output:

['technology company provides', 
'provides software solutions', 
'product suite includes', 'life health insurance',...]

How can I concatenate these values without losing the number of occurrences? I'd like to be able to export this to a CSV to review the full list.

CodePudding user response:

given:

din = {('technology', 'company', 'provides'): 2,
 ('various', 'industries.', 'company'): 2,
 ('provides', 'software', 'solutions'): 2,
 ('life', 'health', 'insurance'): 2}  

In would proceed as follows:

def reportValues(d):
    result = []
    for ky, v in d.items():
        result.append([' '.join(ky), v])
    return result 

result = reportValues(din)
for r in result:
    print(f'{r[0]:25}\t{r[1]}')   

which produces:

technology company provides 2
various industries. company 2
provides software solutions 2
life health insurance       2

CodePudding user response:

import pandas as pd

result = {('technology', 'company', 'provides'): 2,
 ('various', 'industries.', 'company'): 2,
 ('provides', 'software', 'solutions'): 2,
 ('life', 'health', 'insurance'): 2}

df = pd.DataFrame(result.items(), columns=['phrase', 'occurrence'])
df.phrase = df.phrase.str.join(' ')
print(df)
df.to_csv('phrases.csv', index=False)

the df output:

                        phrase  occurrence
0  technology company provides           2
1  various industries. company           2
2  provides software solutions           2
3        life health insurance           2

the csv file:

phrase,occurrence
technology company provides,2
various industries. company,2
provides software solutions,2
life health insurance,2
  • Related