Home > database >  Find string in a dataframe from a list in another dataframe
Find string in a dataframe from a list in another dataframe

Time:10-20

I have 2 pandas dataframes in python which are set up as folows:

Dataframe 1:
ID      Paragraph
1      'A B C D E'
2      'A F G H L'
3      'B J P Q W'
4      'G F D S A'

Where Paragraph is a string of multiple words.

Dataframe 2: 
ID      Name        Words
1      First      ['A', 'F']
2      Second     ['B', 'Z']
3      Thrird     ['P', 'Q']
4      Fourth     ['H', 'J']

Name is just a string identifying the Words. And Words is a list of strings.

So what I want to do is have an expression that will identify which Paragraphs in Dataframe 1 contain Words from Dataframe 2. And I want to store the Name of the Words in a new column in Dataframe 1. The new column will contain a list of all the Names where a Word from Words occurred in the paragraph. The order does not matter and there must be no duplicates in the list.

For example:

New Dataframe 1:
ID      Paragraph             Names
1      'A B C D E'       [First, Second]
2      'A F G H L'       [First, Fourth]
3      'B J P Q W'   [Second, Third, Fourth]
4      'G F D S A'           [First]

I can only make a solution that has deeply nested for loops and takes a very long time to execute. Is there a solution that has a shorter computation time my thinking is maybe using loc and/or lambda functions.

Any help would be greatly appreciated!

Let me know if there is anything I need to clarify.

English is not my first language so I can try explain more if I need to.

Thank you

Here is the code for the dummy dataframes:

data_1 = {'Paragraph': ['A B C D E', 'A F G H L', 'B J P Q W', 'G F D S A']}
df_1 = pd.DataFrame(data_1)

data_2 = {'Name': ['First', 'Second', 'Third', 'Fourth'],
          'Words': [['A', 'F'], ['B', 'Z'], ['P', 'Q'], ['H', 'J']]}
df_2 = pd.DataFrame(data_2)

CodePudding user response:

You can split and explode Paragraph. Then map the names for each word of the exploded df_2. Finally, aggregate as set to have unique values:

s = df_2.explode('Words').set_index('Words')['Name']
df_1['Names'] = (df_1['Paragraph'].str.split()
                 .explode().map(s).dropna()
                 .groupby(level=0).agg(set)
                )

output:

   Paragraph                    Names
0  A B C D E          {Second, First}
1  A F G H L          {Fourth, First}
2  B J P Q W  {Third, Second, Fourth}
3  G F D S A                  {First}
  • Related