I have 2 pandas dataframes in python which are set up as folows:
Dataframe 1:
ID Paragraph
1 'A B C D E'
2 'A F G H L'
3 'B J P Q W'
4 'G F D S A'
Where Paragraph is a string of multiple words.
Dataframe 2:
ID Name Words
1 First ['A', 'F']
2 Second ['B', 'Z']
3 Thrird ['P', 'Q']
4 Fourth ['H', 'J']
Name is just a string identifying the Words. And Words is a list of strings.
So what I want to do is have an expression that will identify which Paragraphs in Dataframe 1 contain Words from Dataframe 2. And I want to store the Name of the Words in a new column in Dataframe 1. The new column will contain a list of all the Names where a Word from Words occurred in the paragraph. The order does not matter and there must be no duplicates in the list.
For example:
New Dataframe 1:
ID Paragraph Names
1 'A B C D E' [First, Second]
2 'A F G H L' [First, Fourth]
3 'B J P Q W' [Second, Third, Fourth]
4 'G F D S A' [First]
I can only make a solution that has deeply nested for loops and takes a very long time to execute. Is there a solution that has a shorter computation time my thinking is maybe using loc and/or lambda functions.
Any help would be greatly appreciated!
Let me know if there is anything I need to clarify.
English is not my first language so I can try explain more if I need to.
Thank you
Here is the code for the dummy dataframes:
data_1 = {'Paragraph': ['A B C D E', 'A F G H L', 'B J P Q W', 'G F D S A']}
df_1 = pd.DataFrame(data_1)
data_2 = {'Name': ['First', 'Second', 'Third', 'Fourth'],
'Words': [['A', 'F'], ['B', 'Z'], ['P', 'Q'], ['H', 'J']]}
df_2 = pd.DataFrame(data_2)
CodePudding user response:
You can split
and explode
Paragraph. Then map
the names for each word of the exploded df_2. Finally, aggregate as set
to have unique values:
s = df_2.explode('Words').set_index('Words')['Name']
df_1['Names'] = (df_1['Paragraph'].str.split()
.explode().map(s).dropna()
.groupby(level=0).agg(set)
)
output:
Paragraph Names
0 A B C D E {Second, First}
1 A F G H L {Fourth, First}
2 B J P Q W {Third, Second, Fourth}
3 G F D S A {First}