-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathreport.txt
17 lines (13 loc) · 3.14 KB
/
report.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Introduction:
Our task in this assignment was to build a classification model based on Bayes theorem. Bayes theorem is nothing more than a way of expressing a probability of interest in terms of others. In this case, we are interested in obtaining the conditional probability of an outcome that is not so easy for us to compute manually. For instace, the probability of an earthquake given x. There hasn't been enough earthquakes observed in conjunction with x to generate a confident statistic (and we hope there aren't), so we use the isolated probabilities of earthquakes, x, and x given earthquakes. The problem is that this equality only holds when the things in x affect the probability of an earthquake separately. Hence, the name "Naive Bayes Classifier". Because it is naive to think the attributes of an instance affect it's label independently. An example: a boy wants to skip school. Let's assume the probability of the boy skipping school increases if the boy tells his mom he is sick (as it should). Let's also assume that the probability of the boy skipping school increases if the boy tells his mom he would rather go to church. Doing one of these two things separately, might cause the boy to skip school. But the boy would be naive to think his chances would increase more if he told both things to his mom (not go to school because he is sick and wants to go to church)
Implementation:
Python 3 was used to implement the classifier. Essentially, everything is encapsulated in the run() method and executed in a procedural manner. No custom classes were used. Instead, dictionaries within dictionaries act as tables (mean, variance, confusion matrix, etc) throughout the program. The choice of using dictionaries as my data structure of choice was made to it's simplicity. Normally, these tables have the same rows and columns (attributes * labels). Other than the file names, there are not many adjustable parameters. As stated in the codde, from line 88 onwards we assume that the labels are "+1" and "-1" so we can compute the confusion matrices. Instead of doing everything in a single loop, trying to achieve runtime(N), and risking myself to uneccesary bugs, I implemented some redundant(but easier to debug nonetheless) traversals. First the training dataset is built. Then, the means and variances tables are computed(training). Next predictions are made to the training set. The process is then repeated for the testing set, but without the training part.
Output:
[email protected]:~/dataMining/assignment3> python3 NaiveBayes.py breast_cancer.train breast_cancer.test
28 28 25 99
16 13 15 62
[email protected]:~/dataMining/assignment3> python3 NaiveBayes.py led.train led.test
468 170 319 1130
253 98 181 602
Thoughts:
As stated in the introduction, using a Bayes classifier comes with a powerful assumption: attributes affect the label independently. On the other hand, we developed a robust classifier, under 200 lines of Python, without the necessity of a gpu cluster. So that's the trade. Personally, this is my favorite kind of assignments; I get to code a lot, learn a lot, and add more projects to gitHub, etc ...