I have 12 files:
['01_2021.csv', '02_2021.csv', '03_2021.csv', '04_2021.csv', '05_2021.csv', '06_2021.csv', '07_2021.csv', '08_2021.csv', '09_2021.csv', '10_2021.csv', '11_2020.csv', '12_2020.csv']
My CSV file structure:
sampleCSVFile in my mypath:
id itemName NonImportantEntries Entries SomeOtherEntries
1 item1 27 111 163
2 item2 16 22 98
Each file has column "ID" with unique value. I wish to scan across all files in order to confirm in which fileName a given ID was last seen. Could someone help with this?
My Code so far:
import os
import pandas as pd
#get your working directory and target folder that contains all your files
path = os.path.join(os.getcwd(),'folder')
files = [os.path.join(path,i) for i in os.listdir(path) if os.path.isfile(os.path.join(path,i))]
files.remove(path '.DS_Store')
files.sort()
#I'm stuck here as the below code seems to add column 'lastSeen' into my output file but it includes rows from all the files in one data frame. How should I approach it?
#for every file in folder, create a separate data frame and read it, for each frame append with column filename as 'lastSeen'. Scan unique IDs through all data frames to find in which data frame name, unique ID was seen last - in this example we are consider months between 2020 and 2021.
df = pd.DataFrame()
for file in files:
_df = pd.read_csv(file)
_df['fileName'] = os.path.split(file)[-1]
df = df.append(_df)
Expected finalFile.csv format:
id lastSeen
1 06_2021
2 12_2020
3 10_2021
...
45000 07_2021
Thanks in advance for any help with this!
CodePudding user response:
Try:
- read only the necessary ("id") columns using
pd.read_csv
and insert a column ("lastSeen") with the file name. append
each DataFrame to create the master dataframe- use
pd.to_datetime
to convert the file names to dates. groupby
and only keep the "id" where the date column is maximum.
path = os.path.join(".", "folder")
files = [f for f in os.listdir(path) if f.endswith(".csv")]
master = pd.DataFrame()
for file in files:
temp = pd.read_csv(os.path.join(path, file),usecols=[0])
temp["lastSeen"] = file.replace(".csv","")
master = master.append(temp, ignore_index=True)
master["date"] = pd.to_datetime(master["lastSeen"], format="%m_%Y")
output = master[master["date"]==master.groupby("id")["date"].transform("max")].drop("date", axis=1)
>>> output
id lastSeen
0 1 01_2021
2 2 02_2021
3 3 02_2021