-
Notifications
You must be signed in to change notification settings - Fork 4
/
Copy pathdesign_decisions.txt
59 lines (42 loc) · 2.33 KB
/
design_decisions.txt
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
Design decisions for ARKcat:
licensing:
I don't know how to approach this. We know we want it to be as non-restrictive as possible.
apache - the important part is that companies can use it
if we have dependencies in python, something like pip will automatically download them.
check out the licensec for scikit learn and our RNN package
This seems overly complicated, but it looks to me like we can use any of the standard open source licenses (MIT/BSD, Apache2.0, GPL3): http://dodcio.defense.gov/OpenSourceSoftwareFAQ.aspx
datasets:
focus: redistribution issues.
should always use the standard version of the dataset - i.e. the bills dataset from tai. official version is on webpage.
should probably be in json format
we can put the datasets here on cab: /cab1/corpora/bayes_opt (just remember to set the permissions to read / write for everyone with chmod +777)
libraries (this overlaps a bit with language):
theano: deeplearning.net
theano, scikit learn are in python.
google's new deep learning library? http://googleresearch.blogspot.com/2015/11/tensorflow-googles-latest-machine_9.html
what functionality do we want?
Train / dev / test:
We should start by requiring train / dev / test, then implement cross validation.
Featurization:
Should be able to calculate features as necessary at test time
In general, should ignore the existence of test data when deciding upon features
But, should allow users to (optionally) make use of unlabeled data for this (possibly including test data)
Need to think of a good way to speak about features (both in words and in code)
Models our system should train:
logistic regression
RNN
future inclusions:
LSTMs,SVMs
Hyperparams:
will fall out when we choose models
L1, L2, elastic net
number of hidden units in RNNs
ideas:
having a distance metric between datasets would allow us to choose datasets that are close to one another for warm starting.
from the beginning: have multi-label support. each example can have more than one label
keeping a classifier:
when saving to disk, we should keep all relevant bits that would go into training the model:
how many examples it was trained on
how many iterations of bayes opt it went through
what kind of model it is
the feature function that maps from the actual data to the feature vectors