Replace value in a pandas data frame column based on a condition-CodePudding

Given a data frame:

   Text        Name
0  aa bb cc    Paul
1  ee ff gg hh NA
2  xx yy       NA
3  zz zz zz    Anton

I want to replace only the cells in column "name" where values are "NA" with the first 3 words from the corresponding row in column "text"

Desired output:

   Text        Name
0  aa bb cc    Paul
1  ee ff gg hh ee ff gg
2  xx yy       xx yy
3  zz zz zz    Anton

My attempt failed:

[' '.join(x.split()[:3]) for x in df['Text'] if df.loc[df['Name'] == 'NA']]

CodePudding user response：

You can split the Text column by then use .str[:3] to access the first three elements

text = df['Text'].str.split(' ').str[:3].str.join(' ')

df['Name'] = df['Name'].mask(df['Name'].isna(), text)
# or
df.loc[df['Name'].isna(), 'Name'] = text
# or
df['Name'] = np.where(df['Name'].isna(), text, df['Name'])

print(df)

          Text      Name
0     aa bb cc      Paul
1  ee ff gg hh  ee ff gg
2        xx yy     xx yy
3     zz zz zz     Anton

CodePudding user response：

df.Name.mask(df.Name.isna(), df.Text.str.split(' ').str[:3].str.join(' '), inplace=True)

Output:

          Text      Name
0     aa bb cc      Paul
1  ee ff gg hh  ee ff gg
2        xx yy     xx yy
3     zz zz zz     Anton

CodePudding user response：

Let us fix your code

df['new'] = [' '.join(x.split()[:3]) if y !=y else y for x, y  in zip(df['Text'],df['Name']) ]
Out[599]: ['Paul', 'ee ff gg', 'xx yy', 'Anton']

CodePudding user response：

You should split text processing via list comprehension and updating of the dataframe into separate steps:

df = pd.DataFrame({"Text": ["aa bb cc", "ee ff gg hh", "xx yy", "zz zz zz"], 
                   "Name": ["Paul", np.nan, np.nan, "Anton"]})

first_3_words = [" ".join(s.split(" ")[:3]) for s in df[df["Name"].isnull().values]["Text"].values]


df.loc[df["Name"].isnull().values, "Name"] = first_3_words

CodePudding user response：

Splitting and joining is costly, you can use a regex for efficiency:

df['Name'] = df['Name'].fillna(df['Text'].str.extract('((?:\w  ){,2}\w )', expand=False))

Output:

          Text      Name
0     aa bb cc      Paul
1  ee ff gg hh  ee ff gg
2        xx yy     xx yy
3     zz zz zz     Anton

Regex:

(            # start capturing
(?:\w  ){,2} # up to 2 words followed by space
\w           # one word
)            # end capturing