I have a dataframe where the first column, lets call it: df['Name'], looks like the "actual" column, and Id like to change it to look the "desired" column in order to do operations on following columns. Here are the actual and desired outputs:
Name (actual) | Name (desired) |
---|---|
string1 | string1 |
Number | string1 |
Number | string1 |
Number | string1 |
string2 | string2 |
Number | string2 |
Number | string2 |
Number | string2 |
Number | string2 |
string3 | string3 |
Number | string3 |
Number | string3 |
string4 | string4 |
Number | string4 |
etc | etc |
There is no fixed number of 'numbers', between the names. Could be 3, could be 300.
I have the following code to forward fill the names as far as the next name:
df['Name'].fillna(method = 'ffill', inplace = True)
but it only works when the cells with numbers are empty.
So, I need to remove all the numbers from the ['Name'] series first, leaving empty cells:
Name |
---|
String1 |
blank |
blank |
blank |
String2 |
blank |
etc... |
I cant find a way to remove the numbers. Ive tried some suggestions I found in other similar posts:
1)
df[df['Name'].apply(lambda x: isinstance(x, str))]
but it seems to do nothing.
2)
df['Name'] = df['Name'].apply(lambda x: isinstance(x, str))
turns the whole ['Name'] series to True, both strings and numbers.
3)
df['Name'] = df[df['Name'].apply(lambda x: isinstance(x, str))]
which gives a value error.
I found the result to 2) strange, but discovered df['Name'].dtype gave me dtype('O'), which Id never seen before, but suggests the names (strings) and numbers (integers/floats) in the ['Name'] series are the same type (numpy objects). Not sure if/how its relevant, but I understood it to mean that Python sees both the text and numbers as being the same type.
Im stuck. Any suggestions on how to remove the numbers and fill the way I explained?
Thanks!
CodePudding user response:
You're close. Try this :
import pandas as pd
import numpy as np
df = pd.DataFrame({'Name (actual)': ['string1', 334, 34, 124, 'string2', 23, 11, 89, 76, 'string3', 53, 4]})
df['Name (desired)'] = df['Name (actual)'].apply(lambda x: x if isinstance(x, str) else np.nan).ffill()
>>> print(df)
CodePudding user response:
Using apply
is not efficient, prefer a vectorial method:
# identify numbers:
m = pd.to_numeric(df['Name'], errors='coerce').notna()
# mask and ffill:
df['Name'] = df['Name'].mask(m).ffill()
Example (assigning to new column "Name 2" for clarity);
Name Name2
0 string1 string1
1 123 string1
2 123 string1
3 123 string1
4 string2 string2
5 123 string2
6 123 string2
7 123 string2
8 123 string2
9 string3 string3
10 123 string3
11 123 string3