Pandas: efficient way to replace entire string with a substring-CodePudding

I have a dataframe that looks like this:

df = pd.DataFrame({
    'name': ['John','William', 'Nancy', 'Susan', 'Robert', 'Lucy', 'Blake', 'Sally', 'Bruce'],
    'injury': ['right hand broken', 'lacerated left foot', 'foot broken', 'right foot fractured', '', 'sprained finger', 'chest pain', 'swelling in arm', 'laceration to arms, hands, and foot']
    })


    name      injury
0   John      right hand broken
1   William   lacerated left foot
2   Nancy     foot broken
3   Susan     right foot fractured
4   Robert  
5   Lucy      sprained finger
6   Blake     chest pain
7   Sally     swelling in arm
8   Bruce     lacerations to arm, hands, and foot      <-- this is a weird case, since there are multiple body parts

Notably, some of the values in the injury column are blank.

I want to replace the values in the injury column with only the affected body part. In my case, that would be hand, foot, finger, and chest, arm. There are dozens more... this is a small example.

The desired dataframe would look like this:

    name      injury
0   John      hand
1   William   foot
2   Nancy     foot
3   Susan     foot
4   Robert  
5   Lucy      finger
6   Blake     chest
7   Sally     arm
8   Bruce     arm, hand, foot

I could do something like this:

df.loc[df['injury'].str.contains('hand'), 'injury'] = 'hand'
df.loc[df['injury'].str.contains('foot'), 'injury'] = 'foot'
df.loc[df['injury'].str.contains('finger'), 'injury'] = 'finger'
df.loc[df['injury'].str.contains('chest'), 'injury'] = 'chest'
df.loc[df['injury'].str.contains('arm'), 'injury'] = 'arm'

But, this might not be the most elegant way.

Is there a more elegant way to do this? (e.g. using a dictionary)

(any advice on that last case with multiple body parts would be appreciated)

Thank you!

CodePudding user response：

The standard way to get the first match of a regex on a string column is to use .extract(), please see the quickstart 10 minutes to pandas: working with text data.

df['injury'].str.extract('(arm|chest|finger|foot|hand)', expand=False)

0      hand
1      foot
2      foot
3      foot
4       NaN
5    finger
6     chest
7       arm
8       arm
Name: injury, dtype: object

Note row 4 returned NaN rather than '' (but it's trivial to apply .fillna('') to the result). More importantly in row 8 we'll only return the first match, not all matches. You need to decide how you want to handle this. See .extractall()

CodePudding user response：

I think you should maintain a list of text, and using apply function:

body_parts = ['hand', 'foot', 'finger', 'chest', 'arm']
def test(value):
    body_text = []
    for body_part in body_parts:
        if body_part in value:
             body_text.append(body_part)
    if body_text:
        return ', '.join(body_text)
    return value
df['injury'] = df['injury'].apply(test)

return:

name    injury
0   John    hand
1   William foot
2   Nancy   foot
3   Susan   foot
4   Robert  
5   Lucy    finger
6   Blake   chest
7   Sally   arm
8   Bruce   hand, foot, arm

CodePudding user response：

selected_words = ["hand", "foot", "finger", "chest", "arms", "arm", "hands"]

df["injury"] = (
    df["injury"]
    .str.replace(",", "")
    .str.split(" ", expand=False)
    .apply(lambda x: ", ".join(set([i for i in x if i in selected_words])))
)
print(df)

      name             injury
0     John               hand
1  William               foot
2    Nancy               foot
3    Susan               foot
4   Robert                   
5     Lucy             finger
6    Blake              chest
7    Sally                arm
8    Bruce  arms, foot, hands