Home > Enterprise >  How to get unique values in df columns?
How to get unique values in df columns?

Time:09-27

I'm reading many excel files. Excel files have this format:

       CODE    STATION    TMAX_SC
0      000659  ACOBAMBA   -999.0
1      000659  ACOBAMBA   -999.0
2      000659  ACOBAMBA   -999.0
3      000659  ACOBAMBA     15.8
4      000659  ACOBAMBA     18.4
      ...       ...      ...
20570  000659  ACOBAMBA     19.2
20571  000659  ACOBAMBA     19.2
20572  000659  ACOBAMBA     18.0
20573  000659  ACOBAMBA     16.9
20574  000659  ACOBAMBA     16.0

Some dfs have one unique value in CODE column and some dfs have 2 values or maybe more.

I want to get a list with the dfs that have 2 or more values in CODE column. So i'm doing this code:

path = "/path/to/file/"
files = os.listdir(path)

files_xls = [os.path.join(path, f) for f in files if f[-3:] == 'xls']

dfs = pd.DataFrame()
stations = pd.DataFrame()

for i in files_xls:
    data = pd.read_excel(i, 'Sheet 1',converters={'CODE':str})
    #print(i,data['CODE'])
    duplicates = {'FILE':[i],
        'CODE':[data['CODE'].drop_duplicates()]}
    duplicates=pd.DataFrame(duplicates)
    stations=stations.append(duplicates)
    dfs = dfs.append(data)

But i got this error AttributeError: 'tuple' object has no attribute 'append'

How can i get a list with the dfs that have 2 or more values in CODE column?

Thanks in advance.

CodePudding user response:

You can try something like: df.groupby(['CODE']).size().reset_index(name="Count")

This should give you counts of each code in a new column. Then you can filter count column for >2.

CodePudding user response:

According to the comment you need A list of the file names when there are more than 2 different CODE values, maybe with the number of different values , I believe you can use nunique and do:

l = [] # for the result
for i in files_xls:
    # get directly the number of unique values in CODE
    n = (pd.read_excel(i, 'Sheet 1', converters={'CODE':str}, 
                       usecols=['CODE'], # just need this columns
                       squeeze=True) # to create a Series instead of a dataframe
         .nunique() # get the number of unique values in the Series CODE column
    )
    if n>1: #if more than a unique CODE in the file
        l.append([i, n]) # append the file name and the number to a list
print(l) 
# or print(pd.DataFrame(l, columns=['file_name', 'Nunique_CODE']))
  • Related