I'm reading many excel files. Excel files have this format:
CODE STATION TMAX_SC
0 000659 ACOBAMBA -999.0
1 000659 ACOBAMBA -999.0
2 000659 ACOBAMBA -999.0
3 000659 ACOBAMBA 15.8
4 000659 ACOBAMBA 18.4
... ... ...
20570 000659 ACOBAMBA 19.2
20571 000659 ACOBAMBA 19.2
20572 000659 ACOBAMBA 18.0
20573 000659 ACOBAMBA 16.9
20574 000659 ACOBAMBA 16.0
Some dfs have one unique value in CODE column and some dfs have 2 values or maybe more.
I want to get a list with the dfs that have 2 or more values in CODE column. So i'm doing this code:
path = "/path/to/file/"
files = os.listdir(path)
files_xls = [os.path.join(path, f) for f in files if f[-3:] == 'xls']
dfs = pd.DataFrame()
stations = pd.DataFrame()
for i in files_xls:
data = pd.read_excel(i, 'Sheet 1',converters={'CODE':str})
#print(i,data['CODE'])
duplicates = {'FILE':[i],
'CODE':[data['CODE'].drop_duplicates()]}
duplicates=pd.DataFrame(duplicates)
stations=stations.append(duplicates)
dfs = dfs.append(data)
But i got this error AttributeError: 'tuple' object has no attribute 'append'
How can i get a list with the dfs that have 2 or more values in CODE column?
Thanks in advance.
CodePudding user response:
You can try something like: df.groupby(['CODE']).size().reset_index(name="Count")
This should give you counts of each code in a new column. Then you can filter count column for >2.
CodePudding user response:
According to the comment you need A list of the file names when there are more than 2 different CODE values, maybe with the number of different values , I believe you can use nunique
and do:
l = [] # for the result
for i in files_xls:
# get directly the number of unique values in CODE
n = (pd.read_excel(i, 'Sheet 1', converters={'CODE':str},
usecols=['CODE'], # just need this columns
squeeze=True) # to create a Series instead of a dataframe
.nunique() # get the number of unique values in the Series CODE column
)
if n>1: #if more than a unique CODE in the file
l.append([i, n]) # append the file name and the number to a list
print(l)
# or print(pd.DataFrame(l, columns=['file_name', 'Nunique_CODE']))