Column of mixed strings and numbers - need to conditionally convert strings to numerical format-CodePudding

I have a script that processes data from a CSV file, however sometimes one of the columns I am using from the CSV file randomly has some of its numbers enclosed in string characters thus making it a string. In the case it is enclosed in string characters, it is always ="x", i.e., if the number is 0.5, then it is showing as ="0.5" . There are also some files where all numbers display numerically and this is not even an issue.

A column that we will call N might look as such:

N

0.15
="0.20"
0.5
="1.25"
="1.55"
2.9

and so on. The values not enclosed by the =" " are numeric and obviously the values enclosed are not. I would like the column to display as such, where all values are stored as numeric:

N

0.15
0.20
0.5
1.25
1.55
2.9

Any assistance would be much appreciated - thank you!

CodePudding user response：

You can use str.strip to strip any unwanted strings from the value. For example:

import csv

with open('your_file.csv', 'r') as f_in:
    reader = csv.reader(f_in)
    header = next(reader) # read header
    next(reader) # skip the empty line

    print(header[0])
    print()

    for row in reader:
        print(row[0].strip('="'))

Prints:

N

0.15
0.20
0.5
1.25
1.55
2.9

The content of your_file.csv was:

N

0.15
="0.20"
0.5
="1.25"
="1.55"
2.9

CodePudding user response：

You can also do this task with the help of pandas library conveniently. Though it might be too much for this task.

# import libraries
import pandas as pd
import re

# create a function to clean data of a column
def clean_data(x):
    if match := re.search(r'=\"(.*?)\"', x):
        return match.group(1)
    return x

# read the csv file
df = pd.read_csv("testing.csv")

# apply the cleaning function to the column's data.
df['N'] = df['N'].apply(clean_data)

# convert the column to numeric
df['N'] = pd.to_numeric(df['N'])

Output:
0    0.15
1    0.20
2    0.50
3    1.25
4    1.55
5    2.90
Name: N, dtype: float64

CodePudding user response：

If you read the CSV file with the built-in csv module, then everything you process is text, if you're using pandas.read_csv, then you'll have a mix of string and floats.

With the csv module:

with open("filename.csv", "r", newline="") as f:
    reader = csv.DictReader(f)
    for row in reader:
        n_raw = row["N"]
        n = n_raw[2:-2] if n_raw.startswith("=") else n_raw
        print(n)

prints

0.15
0.2
0.5
1.2
1.5
2.9

With pandas (turning everything into floats).

import pandas as pd

df = pd.read_csv("filename.csv")
mask = df["N"].str.startswith("=")
df.loc[mask, "N"] = df.loc[mask, "N"].str.slice(start=2, stop=-2)
df["N"] = pd.to_numeric(df["N"])

df contains

CodePudding user response：

Thanks everyone for the quick responses. In the meanwhile, the following also ended up working for me. I ended up converting everything in the column to strings, then dropping the =" " characters, then converting everything to float.

df['N'] = df['N'].astype(str)
df['N'] = df['N'].str.replace('="', '')
df['N'] = df['N'].str.replace('"', '')
df['N'] = df['N'].astype(float)