I have a script that processes data from a CSV file, however sometimes one of the columns I am using from the CSV file randomly has some of its numbers enclosed in string characters thus making it a string. In the case it is enclosed in string characters, it is always ="x", i.e., if the number is 0.5, then it is showing as ="0.5" . There are also some files where all numbers display numerically and this is not even an issue.
A column that we will call N might look as such:
N
0.15
="0.20"
0.5
="1.25"
="1.55"
2.9
and so on. The values not enclosed by the =" " are numeric and obviously the values enclosed are not. I would like the column to display as such, where all values are stored as numeric:
N
0.15
0.20
0.5
1.25
1.55
2.9
Any assistance would be much appreciated - thank you!
CodePudding user response:
You can use str.strip
to strip any unwanted strings from the value. For example:
import csv
with open('your_file.csv', 'r') as f_in:
reader = csv.reader(f_in)
header = next(reader) # read header
next(reader) # skip the empty line
print(header[0])
print()
for row in reader:
print(row[0].strip('="'))
Prints:
N
0.15
0.20
0.5
1.25
1.55
2.9
The content of your_file.csv
was:
N
0.15
="0.20"
0.5
="1.25"
="1.55"
2.9
CodePudding user response:
You can also do this task with the help of pandas library conveniently. Though it might be too much for this task.
# import libraries
import pandas as pd
import re
# create a function to clean data of a column
def clean_data(x):
if match := re.search(r'=\"(.*?)\"', x):
return match.group(1)
return x
# read the csv file
df = pd.read_csv("testing.csv")
# apply the cleaning function to the column's data.
df['N'] = df['N'].apply(clean_data)
# convert the column to numeric
df['N'] = pd.to_numeric(df['N'])
Output:
0 0.15
1 0.20
2 0.50
3 1.25
4 1.55
5 2.90
Name: N, dtype: float64
CodePudding user response:
If you read the CSV file with the built-in csv
module, then everything you process is text, if you're using pandas.read_csv
, then you'll have a mix of string and floats.
With the csv
module:
with open("filename.csv", "r", newline="") as f:
reader = csv.DictReader(f)
for row in reader:
n_raw = row["N"]
n = n_raw[2:-2] if n_raw.startswith("=") else n_raw
print(n)
prints
0.15
0.2
0.5
1.2
1.5
2.9
With pandas (turning everything into floats).
import pandas as pd
df = pd.read_csv("filename.csv")
mask = df["N"].str.startswith("=")
df.loc[mask, "N"] = df.loc[mask, "N"].str.slice(start=2, stop=-2)
df["N"] = pd.to_numeric(df["N"])
df
contains
N
0 0.15
1 0.2
2 0.5
3 1.2
4 1.5
5 2.9
CodePudding user response:
Thanks everyone for the quick responses. In the meanwhile, the following also ended up working for me. I ended up converting everything in the column to strings, then dropping the =" " characters, then converting everything to float.
df['N'] = df['N'].astype(str)
df['N'] = df['N'].str.replace('="', '')
df['N'] = df['N'].str.replace('"', '')
df['N'] = df['N'].astype(float)