Given a data frame:
Text Name
0 aa bb cc Paul
1 ee ff gg hh NA
2 xx yy NA
3 zz zz zz Anton
I want to replace only the cells in column "name" where values are "NA" with the first 3 words from the corresponding row in column "text"
Desired output:
Text Name
0 aa bb cc Paul
1 ee ff gg hh ee ff gg
2 xx yy xx yy
3 zz zz zz Anton
My attempt failed:
[' '.join(x.split()[:3]) for x in df['Text'] if df.loc[df['Name'] == 'NA']]
CodePudding user response:
You can split the Text
column by
then use .str[:3]
to access the first three elements
text = df['Text'].str.split(' ').str[:3].str.join(' ')
df['Name'] = df['Name'].mask(df['Name'].isna(), text)
# or
df.loc[df['Name'].isna(), 'Name'] = text
# or
df['Name'] = np.where(df['Name'].isna(), text, df['Name'])
print(df)
Text Name
0 aa bb cc Paul
1 ee ff gg hh ee ff gg
2 xx yy xx yy
3 zz zz zz Anton
CodePudding user response:
df.Name.mask(df.Name.isna(), df.Text.str.split(' ').str[:3].str.join(' '), inplace=True)
Output:
Text Name
0 aa bb cc Paul
1 ee ff gg hh ee ff gg
2 xx yy xx yy
3 zz zz zz Anton
CodePudding user response:
Let us fix your code
df['new'] = [' '.join(x.split()[:3]) if y !=y else y for x, y in zip(df['Text'],df['Name']) ]
Out[599]: ['Paul', 'ee ff gg', 'xx yy', 'Anton']
CodePudding user response:
You should split text processing via list comprehension and updating of the dataframe into separate steps:
df = pd.DataFrame({"Text": ["aa bb cc", "ee ff gg hh", "xx yy", "zz zz zz"],
"Name": ["Paul", np.nan, np.nan, "Anton"]})
first_3_words = [" ".join(s.split(" ")[:3]) for s in df[df["Name"].isnull().values]["Text"].values]
df.loc[df["Name"].isnull().values, "Name"] = first_3_words
CodePudding user response:
Splitting and joining is costly, you can use a regex for efficiency:
df['Name'] = df['Name'].fillna(df['Text'].str.extract('((?:\w ){,2}\w )', expand=False))
Output:
Text Name
0 aa bb cc Paul
1 ee ff gg hh ee ff gg
2 xx yy xx yy
3 zz zz zz Anton
Regex:
( # start capturing
(?:\w ){,2} # up to 2 words followed by space
\w # one word
) # end capturing