I have a df
that looks like
df
:
A | B | C | D |
---|---|---|---|
2.0 | NaN | Text String 1 | 3.0 |
Text String 2 | NaN | 3.0 | 2.0 |
2.0 | Text String 3 | 3.0 | NaN |
I want create a new column new_column
that look for all the row values in the column, and as long as any column contains string like [^A-Za-z\s]
, my new_column
get that exact value
for example, my desired output is something like
A | B | C | D | new_column |
---|---|---|---|---|
2.0 | NaN | Text String 1 | 3.0 | Text String 1 |
Text String 2 | NaN | 3.0 | 2.0 | Text String 2 |
2.0 | Text String 3 | 3.0 | NaN | Text String 3 |
I've tried str.contains
or other df methods but the problem is to get the value for any column that has a text value, not from any specific column.
Thanks!
CodePudding user response:
You can reshape by DataFrame.stack
and select strings columns by regex:
s = df.stack()
df['new_column'] = s[s.astype('str').str.contains('(^[A-Za-z\s])')].droplevel(1)
Similar idea is comvert values to numeric if possible and get non numeric strings values:
s = df.stack()
df['new_column'] = s[pd.to_numeric(s, errors='coerce').isna()].droplevel(1)
--
Or join values by ,
and select strings with number pattern:
df['new_column'] = df.astype('str').agg(','.join, 1).str.extract('([A-Za-z\s] \d )')
print (df)
A B C D new_column
0 2.0 NaN Text String 1 3.0 Text String 1
1 Text String 2 NaN 3.0 2.0 Text String 2
2 2.0 Text String 3 3.0 NaN Text String 3
CodePudding user response:
You can write a function to select the string value from a row and then apply that to the dataFrame:
def get_str(row):
return [v for v in row if type(v) is str][0]
df['new_column'] = df.apply(get_str, axis=1)
Output:
A B C D new_column
0 2.0 NaN Text String 1 3.0 Text String 1
1 Text String 2 NaN 3.0 2.0 Text String 2
2 2.0 Text String 3 3.0 NaN Text String 3
If a row might not contain a string value, you will need to check for that situation in get_str
and return a default value e.g.
def get_str(row):
try:
return [v for v in row if type(v) is str][0]
except:
return np.nan
df = pd.DataFrame({ 'A' : [2.0, 'Text String 2', 2.0, 3.0], 'B': [np.nan, np.nan, 'Text String 3', 2.0], 'C': ['Text String 1', 3.0, 3.0, np.nan], 'D': [3.0, 2.0, np.nan, 1.0] })
df['new_column'] = df.apply(get_str, axis=1)
Output:
A B C D new_column
0 2.0 NaN Text String 1 3.0 Text String 1
1 Text String 2 NaN 3.0 2.0 Text String 2
2 2.0 Text String 3 3.0 NaN Text String 3
3 3.0 2.0 NaN 1.0 NaN