FIlrer csv table to have just 2 columns. Python pandas pd .pd-CodePudding

i got .csv file with lines like this :

result,table,_start,_stop,_time,_value,_field,_measurement,device
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:35Z,44.61,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:40Z,17.33,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:45Z,41.2,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:51Z,33.49,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:56Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:12:57Z,55.68,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:02Z,25.92,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0
,0,2022-10-23T08:22:04.124457277Z,2022-11-22T08:22:04.124457277Z,2022-10-24T12:13:08Z,5.71,power,shellies,Shelly_Kitchen-C_CoffeMachine/relay/0

I need to make them look like this:

                   time  value
0  2022-10-24T12:12:35Z  44.61
1  2022-10-24T12:12:40Z  17.33
2  2022-10-24T12:12:45Z  41.20
3  2022-10-24T12:12:51Z  33.49
4  2022-10-24T12:12:56Z  55.68

I will need that for my anomaly detection code so I dont have to manualy delete columns and so on. At least not all of them. I cant do it with the program that works with the mashine that collect wattage info. I tried this but it doeasnt work enough:

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')
df = pd.pivot(df, index = '_time', columns = '_field', values = '_value')
df.interpolate(method='linear') # not neccesary

It gives this output:

            0
9      83.908
10     80.342
11     79.178
12     75.621
13     72.826
...       ...
73522  10.726
73523   5.241

CodePudding user response：

Here is the canonical way to project down to a subset of columns in the pandas ecosystem.

df = df[['_time', '_value']]

CodePudding user response：

Here is an example of how you can do this:

import pandas as pd

# Read the CSV file into a pandas DataFrame
df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')

# Convert the _time column to a datetime type
df['_time'] = pd.to_datetime(df['_time'], format='%Y-%m-%dT%H:%M:%SZ')

# Drop the columns that you don't need
df = df.drop(columns=['result', 'table', '_start', '_stop', '_field', '_measurement', 'device'])

# Set the index of the DataFrame to the _time column
df = df.set_index('_time')

# Print the resulting DataFrame
print(df)

This code reads the CSV file into a pandas DataFrame, converts the _time column to a datetime type, drops the columns that you don't need, sets the index of the DataFrame to the _time column, and then prints the resulting DataFrame.

You can modify this code to suit your specific needs and requirements. For example, you can use the interpolate method to fill in missing values in the _value column, or you can use the resample method to resample the data to a different time interval.

CodePudding user response：

You can simply use the keyword argument usecols of pandas.read_csv :

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv', usecols=["_time", "_value"])

NB: If you need to read the entire data of your (.csv) and only then select a subset of columns, Pandas core developers suggest you to use pandas.DataFrame.loc. Otherwise, by using df = df[subset_of_cols] synthax, the moment you'll start doing some operations on the (new?) sub-dataframe, you'll get a warning :

SettingWithCopyWarning:
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

So, in your case you can use :

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df.loc[:, ["_time", "_value"]] #instead of df[["_time", "_value"]]

Another option is pandas.DataFrame.copy,

df = pd.read_csv('coffee_machine_2022-11-22_09_22_influxdb_data.csv')
df = df[["_time", "_value"]].copy()

CodePudding user response：

.read_csv has a usecols parameter to specify which columns you want in the DataFrame.

df = pd.read_csv(f,header=0,usecols=['_time','_value'] )
print(df)

                  _time  _value
0  2022-10-24T12:12:35Z   44.61
1  2022-10-24T12:12:40Z   17.33
2  2022-10-24T12:12:45Z   41.20
3  2022-10-24T12:12:51Z   33.49
4  2022-10-24T12:12:56Z   55.68
5  2022-10-24T12:12:57Z   55.68
6  2022-10-24T12:13:02Z   25.92
7  2022-10-24T12:13:08Z    5.71