I currently have a CSV column that contains three different datasets (P,T,C) looking like: P= 0.456T= 12.659C=39.285
This is just one out of many records but all the data is separated the same with P,T,C. I have posted the dataframe head below:
I want to separate each individual component (P,T,C) and create a time series analysis.
Currently, I am using str.split() command but it does not seem to work.
My code is as follows:
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from pandas import read_csv
df = pd.read_csv('20220519 - test data.csv')
df.columns = ['node_letter','wind_Dir','Wind Speed','Corrected_Dir','GPS_Corr_Speed','Pressure','Relative_Humidity','Temperature','Dewpoint','Prec_Total','Precip_Intensity','Solar_Radiation','GPS_Data','Date/Time','Supply_Voltage','satus','ETX','CTD_Data']
df[['Pressure','Temperature','Conducivity']] = df['CTD_Data'].str.split('=', expand=True)
Dataframe name: df
Column name: df['CTD_Data']
Do I need to create a seperate dataframe for the aforementioned column?
Any help would be greatly appreciated
CodePudding user response:
try this:
s='P= 0.456T= 12.659C=39.285'
p = re.findall('[PTC]=\ ?[0-9.] ',s.replace(' ', ''))
p
>['P= 0.456', 'T= 12.659', 'C=39.285']
[float(r.split('=')[1]) for r in p]
>[0.456, 12.659, 39.285]
CodePudding user response:
A naive, manual and probably not so efficient (for large datasets) way:
df = pd.DataFrame()
df['problematic_column'] = [('P= 0.456T= 12.659C=39.285')]
df['P'] = float(df['problematic_column'].str.split('=').str[1].str.split('T').str[0])
df['T'] = float(df['problematic_column'].str.split('=').str[2].str.split('C').str[0])
df['C'] = float(df['problematic_column'].str.split('=').str[3])
There must be more elegant solutions using regex out there, but that's probably for you to investigate. This will get the job done.