Tying to "filter" years from a text file in python. [Can't use pandas or numpys]-CodePudding

Tying to "filter" years from a text file. [Can't use pandas or numpys]

I only want to keep years 2015-2020. So if year isn't 2015-2020 remove the record alltogether. Origional data looks like:

series_id year period value footnote_codes

LASST480000000000003 1976 M01 5.8
LASST480000000000003 1976 M02 5.8

Previosly, I split the date into list and then I tried to create an "if". . The code doesn't show any errors but is still not filtering the years. Here are a few codes I have tried:

  unwanted_years={'1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990'}

  #Filtering years(First attempt):
    if row[1] in unwanted_years:
        rows.strip()

   #Filtering years(2 attempt)
    if row[1] in unwanted_years:
        row.remove()

   #Filtering years(3 attempt)
    if row[1] in unwanted_years:
        row.remove('/t')

Any help is appreciated! Full code looks like this:

  #Creating string for formated columns
  new_colum = '' 
  unwanted_years={'1976','1977','1978','1979','1980','1981','1982','1983','1984','1985','1986','1987','1988','1989','1990'}
  #Calling file
   output_file = open('Desktop/HW3/HW3_1/'   state_names[counter]    '.txt','w')


  #for loop to manipulate file
  for counter, url in enumerate(urls):
print('url:', url)

html = urllib.request.urlopen(url).read().decode('utf-8')  # opening url

rows = html.splitlines()  # split the data in rows. DON'T NEED `1` because I don't need `\n'


    

#Starting counter for rows
    #First row/header
if counter == 0:
    #Creating title for new column
    new_colum = rows[0]  '\t'  "Month_Year" '\n'  
    #Removing unwanted column from header
    new_colum=new_colum.replace('footnote_codes',' ')
    
    #Next rows:
for row in rows[1:]:  # get directly string instead of index
    
  #Filtering years:
    if row[1] == '1976':
        row.strip()
    
       
    #Converting rows into a list
    row = row.split('\t')
           
    #Creating new value using column 1 and 2 (month_year)
    new_value ='\t' row[2]   '_'   row[1]
            
    
    #Appending to list
    row.append(new_value)
    
    #Removing unwanted column from all the data
    row.remove(row[4])
    
          
    #Converting back to a string
    row = '\t'.join(row)
    
   
    #Adding that mont_year column to all the data                    
    new_colum  = row   '\n'  
   
    

#Saving file with changes

output_file.write(new_colum)  # once finished with the for loops then the it will download and close.
output_file.close()

CodePudding user response：

It's simpler to check if the year is wanted.

wanted_years = ("2015","2016","2017","2018","2019","2020")

# just filter rows by list comprehension
filtered_rows = [r for r in rows if all(x not in r for x in wanted_years)]

CodePudding user response：

I assume you can use the csv module since it comes with Python. If your actual delimiter is tab use delimiter='\t' instead, but tabs don't work well on SO so I'll use comma delimiter as an example:

input.csv

series_id,year,period,value,footnote_codes
LASST480000000000003,1976,M01,5.8
LASST480000000000003,1978,M02,5.8
LASST480000000000003,1980,M01,5.8
LASST480000000000003,1982,M01,5.8
LASST480000000000003,1984,M02,5.8
LASST480000000000003,2014,M01,5.8
LASST480000000000003,2015,M01,5.8
LASST480000000000003,2016,M01,5.8
LASST480000000000003,2017,M01,5.8
LASST480000000000003,2018,M01,5.8
LASST480000000000003,2019,M01,5.8
LASST480000000000003,2020,M01,5.8
LASST480000000000003,2021,M01,5.8
LASST480000000000003,2022,M01,5.8

demo.py

import csv

with (open('input.csv', 'r', newline='') as fin,
      open('output.csv', 'w', newline='') as fout):

    reader = csv.reader(fin, delimiter=',')
    writer = csv.writer(fout, delimiter=',')
    writer.writerow(next(reader)) # move headers to output file

    for row in reader:
        if 2015 <= int(row[1]) <= 2020:
            writer.writerow(row)

output.csv

series_id,year,period,value,footnote_codes
LASST480000000000003,2015,M01,5.8
LASST480000000000003,2016,M01,5.8
LASST480000000000003,2017,M01,5.8
LASST480000000000003,2018,M01,5.8
LASST480000000000003,2019,M01,5.8
LASST480000000000003,2020,M01,5.8

If you want to write different columns you can use DictReader/DictWriter. Below will only write the first two columns to the output:

import csv

with (open('input.csv', 'r', newline='') as fin,
      open('output.csv', 'w', newline='') as fout):

    reader = csv.DictReader(fin, delimiter=',')
    writer = csv.DictWriter(fout, delimiter=',',
                            fieldnames=reader.fieldnames[:2],
                            extrasaction='ignore')
    writer.writeheader()
    
    for row in reader:
        if 2015 <= int(row['year']) <= 2020:
            writer.writerow(row)

output.csv

series_id,year
LASST480000000000003,2015
LASST480000000000003,2016
LASST480000000000003,2017
LASST480000000000003,2018
LASST480000000000003,2019
LASST480000000000003,2020