I want to change the dataframe cell(?) value as shown below
before | col1 | after | col1 | |
---|---|---|---|---|
0 | 10.0 | - | 0 | 10.0 |
1 | 20 (15) | - | 1 | 20.0 |
2 | ND | - | 2 | None |
3 | 30.0 | - | 3 | 30.0 |
4 | 40.0 | - | 4 | 40.0 |
df=pd.DataFrame([10.0, '20 (15)', 'ND', 30.0, 40.0], columns=['col1'])
for data in df['col1']:
if type(data) is str:
temp=data.split(' ')[0]
if data == 'ND':
data = None
else:
data = float(temp)
this code don't update the dataframe value.
help please
CodePudding user response:
Use pandas alternative Series.str.split
Series.str.rsplit
first, if integers get missing values so replace values by Series.fillna
and convert to numeric by to_numeric
with errors='coerce'
for missing values if non numbers:
df['col1'] = pd.to_numeric(df['col1'].str.split().str[0]
.fillna(df['col1']), errors='coerce')
print (df)
col1
0 10.0
1 20.0
2 NaN
3 30.0
4 40.0
If need extract first integer or floats use Series.str.extract
:
df=pd.DataFrame(['*10.0', '20 (15)', 'ND', 30.0, 40.0], columns=['col1'])
df['col1'] = pd.to_numeric(df['col1'].str.extract('(\d \.\d |\d )', expand=False)
.fillna(df['col1']), errors='coerce')
print (df)
col1
0 10.0
1 20.0
2 NaN
3 30.0
4 40.0
CodePudding user response:
You shouldn't modify your data in a loop. In you case, while you modify the variable data
, this one is no longer linked to the DataFrame's data. In addition, while there are methods to do this, looping over rows is inefficient.
You can use vectorial code instead:
df['col1'] = pd.to_numeric(df['col1'].astype(str).str.extract('([.\d] )',
expand=False), errors='coerce')
or if you want to ensure valid floats as independent words:
df['col1'] = pd.to_numeric(df['col1'].astype(str).str.extract('\b(\d (?:\.\d )?\b)',
expand=False), errors='coerce')