Given a DataFrame, I would like to inspect a column against a list of values and tell whether or not that word exists in the cell, and add the result of the index (character number of where it is located) in a new column.
Example:
df_Words
Words |
---|
Dog, Cat, Duck |
Cow, Horse, Pig |
Monkey, Snake, Rat |
xList = [Dog, Rat, Horse]
df_Words
Words | Index |
---|---|
Dog, Cat, Duck | 1 |
Cow, Horse, Pig | 6 |
Monkey, Snake, Rat | 16 |
I tried with functions/loops but can't get it to work appropriately. I can only do it for a single value like:
xList = [Dog, Rat, Horse]
for x in xList:
index = df_Words["Words"].str.find(x)
df_Words["Indexes"]= index
CodePudding user response:
Create a regex from the list of words and split the string into 2 parts. The length of the first part is the index.
WORDS = fr"\b({'|'.join(xList)})\b"
df['Index'] = df['Words'].str.split(WORDS, n=1, expand=True)[0].apply(len) 1
Output:
>>> df
Words Index
0 Dog, Cat, Duck 1
1 Cow, Horse, Pig 6
2 Monkey, Snake, Rat 16
>>> print(WORDS)
\b(Dog|Rat|Horse)\b
CodePudding user response:
Not exactly the expected ouput, but you could try it like this (first seperate the words into a list):
Sample input:
import pandas as pd
df_raw = pd.DataFrame({
'Words': {0: ['Dog', 'Cat', 'Duck'],
1: ['Horse', 'Cow', 'Pig'],
2: ['Monkey', 'Snake', 'Rat']}})
xList = ['Dog', 'Rat', 'Horse']
Code:
df = df_raw.reset_index().copy()
df = df.explode('Words')
df['pos'] = df.groupby(['index']).cumcount()
df = df[df['Words'].isin(xList)]
df = df.groupby(['index'])['pos'].apply(list)
df_raw.join(df)
Output:
Words pos
0 [Dog, Cat, Duck] [0]
1 [Horse, Cow, Pig] [0]
2 [Monkey, Snake, Rat] [2]
The pos
gives you the index of the list. E.g. [0] means the element with index 0 of the list in column Words.
CodePudding user response:
Your loop-based approach is on the right track.
It's just that you always overwrite the single value of df_Words["Indexes"]
with the last index found, and you said in your description that you need a collection of indexes.
So, if df_Words["Indexes"]
is a list instead, whether provided to you from somewhere else or initialized by you (df_Words["Indexes"] = []
), you can use df_Words["Indexes"].append(index)
, so you add the last value to the list, rather than overwriting the entire list with a single number.
You can also do it without am explicit loop, although effectively, you can not avoid iteration. I assume words are separated by whitespace, and there are no quoted fields which should be considered a "word", which e.g. some CSV dialects would allow.
xList = [Dog, Rat, Horse]
df_Words["Indexes"] = [index for (index, word) in enumerate(df_Words["Words"].str.split()) if word in xList]