I have a Pandas dataframe that is structured like the following:
Report1 Report2 Report3 Report4 Report5 ...
=================================================
Att1 1 1 1
Att2 1 1 1
Att3 1 1
Att4 1 1 1
...
I am trying to select which reports are most useful while also covering all the Attributes. I was thinking about using some distance calculation to pick reports with the "most value" (i.e., reports with the most unique attributes/most significant).
For example, in the above results, I would probably want Report2
and Report3
selected since together they contain all the attributes.
Are there any good distance or optimization functions I could call to perform an analysis like this?
My actual dataset has 1500 attributes and I have nearly 60 reports. I want to select a minimal subset of the reports that maximizes the combined number of attributes.
CodePudding user response:
There's ~ 2^60 (nC1 nC2 ... nCn) possible combinations here.
I think a useful thing to do might be to determine what kind of problem this falls into (e.g. P / NP / NP hard). I don't know much about the classification but should be a start if you want to limit your search scope.
The first thing I would look at is the report with the highest number of attributes. Suppose you have 1500 attrs and a report has 90% (i.e. 1350) of all attrs, then the worst case scenario you would only need a combination of 150 reports. This would instantly drop the number of possible solutions to 2^150 (which is still greater than 2^60). So apply this rule again, with say the first 2 reports which covers 99% of the attrs, this would drop the number of possible solutions to 2^15, which is a very manageable number even for python.
https://en.wikipedia.org/wiki/NP-hardness
CodePudding user response:
If I'm understanding this correctly, you want to find the column(s) with the most nonzero elements? I think something like this might work for you depending on the underlying data
df.astype(bool).sum(axis=0).sort_values()
You can get the attributes you want then by looking at the sorted list