Home > database >  How many groups have at least 1 non-zero value?
How many groups have at least 1 non-zero value?

Time:09-01

I want to plot bar graph from the dataframe below.

df2 = pd.DataFrame({'URL': ['A','A','B','B','C','C'],
                    'X': [5,0,7,1,0,0],
                    'Y': [4,0,4,7,9,0],
                    'Z':[11,0,8,4,0,0]})


   URL  X   Y   Z
0   A   5   4   11
1   A   0   0   0
2   B   7   4   8
3   B   1   7   4
4   C   0   9   0
5   C   0   0   0

The URL counts should be on the y-axis, and X, Y, Z categories should be on the x-axis with two bars for each.

I have done the first bar (code at the end): it will show the number of non-zero values in each column.

I need help with the second bar:

it should show how many (duplicate) URLs have at least 1 non-zero corresponding X, Y or Z value.

For example:

  • A comes two times in the URL, and in the X-column we have one non-zero value - so it will count as 1.

  • In the case of B: both values are non-zero, so we will count B as 1 as well.

  • But in the case of C: since both values are 0 in X we will not count it as one.

The same goes for Y and Z.


I have managed to draw a bar graph for the first case, but for the second one I'm unable to:

(df2.melt("URL")
    .groupby("variable")
    .agg(Keywords_count = ("value", lambda x: sum(x != 0)),
         dup = ("URL", "nunique"))
    .plot(kind="bar") )

plt.show()

CodePudding user response:

Use .apply() instead of .agg().

.apply can take the whole dataframe for each group (and thus can combine information from multiple columns), while .agg operates only on one column at a time.

def f(df):
    n_non_zero_vals = (df["value"] != 0).sum()
    # # If only duplicated URLs should be considered:
    # df = df[df['URL'].duplicated(keep=False)]
    # # Alternative:
    # df = df.groupby("URL").filter(lambda x: len(x) > 1)
    n_urls = df.groupby("URL")["value"].agg("any").sum()
    return pd.Series([n_non_zero_vals, n_urls],
                     index=['Keywords_count', 'dup'])

(df2.melt("URL")
    .groupby("variable")
    .apply(f)
    .plot(kind="bar") )

plt.show()
  • Related