I have gridded emission .txt files which are in multiple folders, each .txt file has more than 1 million rows so excel can't open it and I need to filter data in each .txt file for specific latitude and longitude. Here I want to give a path for the folder so python can read each file from a folder and filter the data for the specific lat long
import pandas as pd
with open(r"C:\Users\VOC_spec_.txt",'r') as f2:
data = f2.read()
df = pd.DataFrame(columns=['lat', 'lon', 'emission']) # create the dataframe
df_sub = subset(df, Lat >= 28 & Lat <= 33 & Long >= 75 & Long <= 97)
this gives an error name 'subset' is not defined
Does anyone know how to write it efficiently and save in .txt format. It will be great help
CodePudding user response:
subset
is a function from R
. Pandas has some options as query
:
df_sub=df.query('lat >= 28 & lat <= 33 & lon >= 75 & lon <= 97')
Or you can also use boolean indexing:
df_sub=df[(df['lat']>=28) & (df['lat']<=33) &(df['lon']>=75) &(df['lon']<=97) ]
CodePudding user response:
You can use the following code to filter a txt file using pandas. Note: you should define the delimiter correctly! it can be a whitespace or a comma or other things.
import pandas as pd
import os
delimiter='\t'
for filename in os.listdir("C:\Users"):
if filename.endswith(".txt"):
file_path = os.path.join(folder_path, filename)
df = pd.read_csv(file_path, delimiter=delimiter)
df_sub = df[(df['Lat'] >= 28) & (df['Lat'] <= 33) & (df['Long'] >= 75) & (df['Long'] <= 97)]
You can then do something with df_sub here, like saving it to a new file. I would suggest write them as a csv file:
new_file_path = os.path.join(folder_path, "filtered_" filename[:-4] ".csv")
df_sub.to_csv(new_file_path, sep=',', index=False)
CodePudding user response:
You can easily do it without Dataframes:
with open(r"C:\Users\VOC_spec_.txt", "r") as file:
data = file.read().strip(" \n").split("\n")
filtered = []
for line in data:
l = line.split(',')
lat, long, em = float(l[0].strip()), float(l[1].strip()), float(l[2].strip())
if lat >= 28 and lat <= 33 and long >= 75 and long <= 97:
filtered.append([lat, long, em])
Now filtered
only contains the matching lines. Note that I split each line with ','
assuming your CSV is comma-delimited. Replace that with the delimiter matching your file.