How to skip a point in a .csv file if it is larger than x?-CodePudding

I have data that has some outliers that need to be ignored, but I am struggling to find out how to do this. I need data that is over the value of 500 to be removed/ignored. Below is my code so far:

import pandas as pd 
import matplotlib

#convert the files to make sure that only the data needed is selected
INPUT_FILE = 'data.csv'
OUTPUT_FILE = 'machine_data.csv'
PACKET_ID = 'machine'

with open(INPUT_FILE, 'r') as f:
data = f.readlines()
with open(OUTPUT_FILE, 'w') as f:
for datum in data:
    if datum.startswith(PACKET_ID):
        f.write(datum)

#read the data file
df = pd.read_csv(OUTPUT_FILE, header=None, usecols=[2,10,11,12,13,14])
#plotting the conc
fig,conc = plt.subplots(1,1)
lns1 = conc.plot(df[2],df[11],color="g", label='Concentration')

As you can see, I have selected certain columns that I need, but within [11] I only need the data that is less than 500.

Would anyone be able to advise? Sorry if this has been posted before.

CodePudding user response：

In order to ignore outliers greater than 500 for column df[11] try something like:

df[11] = df[11].where(df[11] <= 500).dropna()

Source: DataFrame.where()

CodePudding user response：

You just have to filter your dataframe by that column like :

df = df[(df[11] <= 500)]

Your code will then look like this:

import pandas as pd 
import matplotlib

#convert the files to make sure that only the data needed is selected
INPUT_FILE = 'data.csv'
OUTPUT_FILE = 'machine_data.csv'
PACKET_ID = 'machine'

with open(INPUT_FILE, 'r') as f:
data = f.readlines()
with open(OUTPUT_FILE, 'w') as f:
for datum in data:
    if datum.startswith(PACKET_ID):
        f.write(datum)

#read the data file
df = pd.read_csv(OUTPUT_FILE, header=None, usecols=[2,10,11,12,13,14])

# filter your data HERE:
df = df[(df[11] <= 500)]

#plotting the conc
fig,conc = plt.subplots(1,1)
lns1 = conc.plot(df[2],df[11],color="g", label='Concentration')