Loop through column to find words' indexes against a list of words-CodePudding

Given a DataFrame, I would like to inspect a column against a list of values and tell whether or not that word exists in the cell, and add the result of the index (character number of where it is located) in a new column.

Example:

df_Words

Words
Dog, Cat, Duck
Cow, Horse, Pig
Monkey, Snake, Rat

xList = [Dog, Rat, Horse]

df_Words

Words	Index
Dog, Cat, Duck	1
Cow, Horse, Pig	6
Monkey, Snake, Rat	16

I tried with functions/loops but can't get it to work appropriately. I can only do it for a single value like:

xList = [Dog, Rat, Horse]
  
for x in xList:

    index = df_Words["Words"].str.find(x)
    df_Words["Indexes"]= index

CodePudding user response：

Create a regex from the list of words and split the string into 2 parts. The length of the first part is the index.

WORDS = fr"\b({'|'.join(xList)})\b"

df['Index'] = df['Words'].str.split(WORDS, n=1, expand=True)[0].apply(len)   1

Output:

>>> df
                Words  Index
0      Dog, Cat, Duck      1
1     Cow, Horse, Pig      6
2  Monkey, Snake, Rat     16

>>> print(WORDS)
\b(Dog|Rat|Horse)\b

CodePudding user response：

Not exactly the expected ouput, but you could try it like this (first seperate the words into a list):

Sample input:

import pandas as pd
df_raw = pd.DataFrame({
 'Words': {0: ['Dog', 'Cat', 'Duck'],
  1: ['Horse', 'Cow', 'Pig'],
  2: ['Monkey', 'Snake', 'Rat']}})
xList = ['Dog', 'Rat', 'Horse']

Code:

df = df_raw.reset_index().copy()
df = df.explode('Words')
df['pos'] = df.groupby(['index']).cumcount()
df = df[df['Words'].isin(xList)]
df = df.groupby(['index'])['pos'].apply(list)

df_raw.join(df)

Output:

                  Words  pos
0      [Dog, Cat, Duck]  [0]
1     [Horse, Cow, Pig]  [0]
2  [Monkey, Snake, Rat]  [2]

The pos gives you the index of the list. E.g. [0] means the element with index 0 of the list in column Words.

CodePudding user response：

Your loop-based approach is on the right track. It's just that you always overwrite the single value of df_Words["Indexes"] with the last index found, and you said in your description that you need a collection of indexes.

So, if df_Words["Indexes"] is a list instead, whether provided to you from somewhere else or initialized by you (df_Words["Indexes"] = []), you can use df_Words["Indexes"].append(index), so you add the last value to the list, rather than overwriting the entire list with a single number.

You can also do it without am explicit loop, although effectively, you can not avoid iteration. I assume words are separated by whitespace, and there are no quoted fields which should be considered a "word", which e.g. some CSV dialects would allow.

xList = [Dog, Rat, Horse]
df_Words["Indexes"] = [index for (index, word) in enumerate(df_Words["Words"].str.split()) if word in xList]