Selecting a Minimal Subset of Columns that Maximizes the Combined Number of Row Attributes in Python-CodePudding

I have a Pandas dataframe that is structured like the following:

      Report1 Report2 Report3 Report4 Report5 ...
=================================================
Att1       1       1       1    
Att2       1               1       1
Att3               1       1
Att4               1               1       1
...

I am trying to select which reports are most useful while also covering all the Attributes. I was thinking about using some distance calculation to pick reports with the "most value" (i.e., reports with the most unique attributes/most significant).

For example, in the above results, I would probably want Report2 and Report3 selected since together they contain all the attributes.

Are there any good distance or optimization functions I could call to perform an analysis like this?

My actual dataset has 1500 attributes and I have nearly 60 reports. I want to select a minimal subset of the reports that maximizes the combined number of attributes.

CodePudding user response：

There's ~ 2^60 (nC1 nC2 ... nCn) possible combinations here.

I think a useful thing to do might be to determine what kind of problem this falls into (e.g. P / NP / NP hard). I don't know much about the classification but should be a start if you want to limit your search scope.

The first thing I would look at is the report with the highest number of attributes. Suppose you have 1500 attrs and a report has 90% (i.e. 1350) of all attrs, then the worst case scenario you would only need a combination of 150 reports. This would instantly drop the number of possible solutions to 2^150 (which is still greater than 2^60). So apply this rule again, with say the first 2 reports which covers 99% of the attrs, this would drop the number of possible solutions to 2^15, which is a very manageable number even for python.

reference: https://www.vedantu.com/question-answer/evaluate-the-sum-of-the-series-nc1 nc2 nc3 ncn-class-11-maths-cbse-5f51ce96aae1253931571f0f

https://en.wikipedia.org/wiki/NP-hardness

CodePudding user response：

If I'm understanding this correctly, you want to find the column(s) with the most nonzero elements? I think something like this might work for you depending on the underlying data

df.astype(bool).sum(axis=0).sort_values()

You can get the attributes you want then by looking at the sorted list