Home > other >  How do I pull the index(es) and column(s) of a specific value from a dataframe?
How do I pull the index(es) and column(s) of a specific value from a dataframe?

Time:10-21

---Hello, everyone! New student of Python's Pandas here.

I have a dataframe I artificially constructed here: https://i.stack.imgur.com/cWgiB.png. Below is a text reconstruction.

df_dict = {
    'header0' : [55,12,13,14,15],
    'header1' : [21,22,23,24,25],
    'header2' : [31,32,55,34,35],
    'header3' : [41,42,43,44,45],
    'header4' : [51,52,53,54,33]
}
index_list = {
    0:'index0',
    1:'index1',
    2:'index2',
    3:'index3',
    4:'index4'
}
df = pd.DataFrame(df_dict).rename(index = index_list)

GOAL:

I want to pull the index row(s) and column header(s) of any ARBITRARY value(s) (int, float, str, etc.). So for eg, if I want the values of 55, this code will return: header0, index0, header2, index2 in some format. They could be list or tuple or print, etc.

CLARIFICATIONS:

  • Imagine the dataframe is of a large enough size that I cannot "just find it manually"
  • I do not know how large this value is in comparison to other values (so a "simple .idxmax()" probably won't cut it)
  • I do not know where this value is column or index wise (so "just .loc,.iloc where the value is" won't help either)
  • I do not know whether this value has duplicates or not, but if it does, return all its column/indexes.

WHAT I'VE TRIED SO FAR:

I've played around with .columns, .index, .loc, but just can't seem to get the answer. The farthest I've gotten is creating a boolean dataframe with df.values == 55 or df == 55, but cannot seem to do anything with it.

Another "farthest" way I've gotten is using df.unstack.idxmax(), which would return a tuple of the column and header, but has 2 major problems:

  1. Only returns the max/min as per the .idxmax(), .idxmin() functions
  2. Only returns the FIRST column/index matching my value, which doesn't help if there are duplicates

I know I could do a for loop to iterate through the entire dataframe, tracking which column and index I am on in temporary variables. Once I hit the value I am looking for, I'll break and return the current column and index. Was just hoping there was a less brute-force-y method out there, since I'd like a "high-speed calculation" method that would work on any dataframe of any size.

Thanks.

EDIT: Added text database, clarified questions.

CodePudding user response:

Use np.where:

r, c = np.where(df == 55)    
list(zip(df.index[r], df.columns[c]))

Output:

[('index0', 'header0'), ('index2', 'header2')]

CodePudding user response:

There is a function in pandas that gives duplicate rows.

duplicate = df[df.duplicated()]
print(duplicate)

CodePudding user response:

Use DataFrame.unstack for Series with MultiIndex and then filter duplicates by Series.duplicated with keep=False:

s = df.unstack()
out = s[s.duplicated(keep=False)].index.tolist()

If need also duplicates with values:

df1 = (s[s.duplicated(keep=False)]
        .sort_values()
        .rename_axis(index='idx', columns='cols')
        .reset_index(name='val'))

If need tet specific value change mask for Series.eq (==):

s = df.unstack()
out = s[s.eq(55)].index.tolist()

CodePudding user response:

So, in the code below, there is an iteration. However, it doesn't iterate over the whole DataFrame, but it just iterates over the columns, and then use .any() to check if there is any of the desierd value. Then using loc feature in the pandas it locates the value, and finally returns the index.

wanted_value = 55

for col in list(df.columns):

    if df[col].eq(wanted_value).any() == True:
        print("row:", *list(df.loc[df[col].eq(wanted_value)].index), ' col', col)
  • Related