current state

gvtulder · Dec 8, 2011 · 9ecec99 · 9ecec99
1 parent 10f70d0
commit 9ecec99
Show file tree

Hide file tree

Showing 21 changed files with 2,569 additions and 0 deletions.
diff --git a/TODO b/TODO
@@ -0,0 +1,222 @@
+TODO: 3-way parameters en 3-way factored parameters - veranderen er nog andere dingen?
+
+TODO: gaussian units (sigma=1)
+
+TODO: truncated exponential units
+
+TODO: wat glue code om RBMs te stacken in DBNs.
+    - samplen van een DBN
+    - DBN training (laag-per-laag)
+    - DBN inference
+
+TODO: passen autoencoders ook binnen dit raamwerk? mss moeten bepaalde dingen dan hernoemd worden...
+
+TODO: github?
+
+TODO: optimised 1D convolution    
+
+TODO: test 1D convolution
+
+TODO: dat FCRBM model voor modeling motion style eens proberen nabouwen!
+  ook interessant hierbij: parameter tying! kan dit in het framework gepast worden?
+
+TODO: gaussian units met learnt variance (gescheiden en samen)
+
+TODO: symmetric beta units
+
+
+
+
+- idee voor later: hierbij ook een 'chunking' framework voegen dat data in grote chunks kan omzetten voor efficient trainen op de GPU? + ook snapshotting erin verwerken? Dat zou voor een vrij compleet en proper framework zorgen.
+
+* TODO: write some nice latex documentation that explains all the different components and how they fit together, and especially which parts of the training/sampling algorithms they implement.
+
+
+
+
+geef aan de statscollector een lijst van Units objecten die de VISIBLES voorstellen, maw alle units die de data voorstellen en die dus initieel waarden zullen aannemen.
+
+Bepaal dan, a.h.v. de parameters-objecten, welke types units allemaal gesampled moeten worden (welke units de HIDDENS voorstellen).
+
+- dit hoeven niet noodzakelijk alle overblijvende units te zijn denk ik.
+- eerst bepalen we alle Parameters objecten die de visible units beinvloeden, en dan zoeken we alle ANDERE units die ook door deze parameters beinvloed worden.
+- als er bij de HIDDENS units zitten die ook bij de VISIBLES zitten, dan is er iets mis. Dan is er namelijk een afhankelijkheid tussen de visibles. exceptie gooien!
+
+- als er Parameters zijn die types HIDDENS verbinden, dan is er ook iets mis. Dan is er namelijk een afhankelijkheid tussen de hiddens. Exceptie gooien! (vb. aWb, bVc en cUa kan niet, we kunnen nooit b en c op basis van a samplen omdat ze niet onafhankelijk zijn van elkaar) - "
+
+- een geval waarbij de hiddens niet 'alle overblijvende units' zijn, is waar er 2 types visibles zijn, die niet tegelijkertijd gesampled moeten worden maar wel dezelfde HIDDENS delen. Dit kan bvb een soort van parameter sharing zijn.
+
+
+
+
+
+
+
+TRAINING
+
+split into multiple phases:
+
+- collect statistics, given input data and the model
+  * for CD-k, this is input visibles, sampled hiddens, and then visibles and hiddens after k steps
+  * for PCD, the negative term is different
+- use statistics to compute the weight updates
+  * for all types of CD, this is just getting the update form from the Parameters object and filling in the blanks.
+- update the weights according to some other hyperparameterised settings (this is where momentum, weight decay etc. are)
+
+
+trainingalgoritme abstraheren
+  - gradient descent with stopping criterion
+  - gradient descent with fixed number of epochs
+  - gradient descent ad infinitum
+  - zijn er nog?
+  - dit roept dan gewoon de paramupdaters op
+wat met monitoring costs?
+
+
+So let's build a hierarchy.
+
+ParamUpdater: composite object (possibly consisting of multiple other ParamUpdaters that are somehow combined (typical usecase: updates are summed)) that updates parameters given some input data
+
+DecayParamUpdater: provides a decay term
+
+MomentumParamUpdater: encapsulates another ParamUpdater and applies momentum to it
+
+CDParamUpdater: encapsulates CD-k or PCD weight update computation
+  * takes the input data, computes statistics by calling a StatsCollector on it
+  * gets the form of the update term from the Parameters object, fills in the blanks with the stats from the StatsCollector
+  * returns the resulting update
+
+SparsityParamUpdater: encapsulates sparsity target regularisation
+
+SumParamUpdater: sums the updates obtained from its child paramupdaters. should check that the composing ParamUpdaters are for the same Parameters!
+
+ScaledParamUpdater: takes a ParamUpdater and a scale factor, and scales the updates by this factor. This will be the result of writing '0.005 * ParamUpdater()' for example.
+
+
+
+StatsCollector: object that collects the statistics for CD, given input
+
+CDStatsCollector(k) < StatsCollector: collects statistics for CD-k
+
+PCDStatsCollector < StatsCollector: collects statistics for PCD
+
+!! since only the negative term differs between CDStatsCollector and PCDStatsCollector, maybe some overlapping code can be factored out here.
+
+- learning rate: should this be a ParamUpdater, or should it be kept outside? it's nice to be able to encapsulate this...
+
+TO IMPLEMENT:
+X base class ParamUpdater(Parameters p, [StatsCollector s])
+- subclasses:
+  X DecayParamUpdater(Parameters p)
+  * MomentumParamUpdater(ParamUpdater pu)
+  X CDParamUpdater(Parameters p, StatsCollector s)
+  * SparsityTargetParamUpdater(Parameters p, StatsCollector s, target) # this also needs stats!
+  * VarianceTargetParamUpdater(Parameters p, target) # maybe, as an exercise, since it doesn't really work anyway
+  X SumParamUpdater([ParamUpdater p])
+      X Maybe overload + on ParamUpdater
+      X also maybe overload some_float * ParamUpdater, with __rmul__ (also implement __mul__). That's a nice way to do regularisation parameters and learning rates (then they don't have to handled inside the ParamUpdater object itself)
+
+- base class StatsCollector
+- subclasses:
+  X CDkStatsCollector(k)
+  * PCDStatsCollector  
+
+
+
+
+PROBLEM: multiple Parameters should be updated using the same collected stats, so the statscollector should only be run once per group of parameters. How can this be guaranteed if each ParamUpdater updates only a single Parameters object?
+Does it make sense instead to let a ParamUpdater update a group of Parameters objects?
+
+SOLUTION: maybe split the update process in a number of phases:
+
+- first, all ParamUpdaters (which each update ONLY ONE Parameters object) are inspected, and the StatsCollector objects they use are extracted: ParamUpdater.get_collectors() or something like that
+- then iterate:
+  * collect stats: run the extracted StatsCollectors (StatsCollector.collect() or something)
+  * run ParamUpdaters
+    - each ParamUpdater calls StatsCollector.get_stats() or get_current_stats() or something to that extent.
+  * That way, each type of stats collection happens only once and all params are updated based on the same stats. We just need to make sure that they each hold a reference to the same StatsCollector object.
+
+
+
+
+
+
+MODULAR RBM
+
+v = a(f(W, h) + g(vbias))
+h = a'(f'(W, v) + g'(hbias))
+
+ActivationFunction
+Sampler
+Units
+
+(unit data u + activation functions a(x) + samplers) = Units
+
+Parameters(list<Units> units)
+  has
+    - list<Units>: list of units that are related by the parameters
+    - a set of actual parameters (weights)
+  provides
+    - contribution to activation of each of the Units
+    - contribution to the energy function
+
+
+RBM
+  has
+    - list<Units>: a set of different types of units
+    - list<Parameters>: a set of different types of parameters that relate the Units
+  provides
+    - sampling a type of units (computing nonlinear activation and sampling)
+    - computing the energy function (summing the contributions for each of the Parameters)
+
+
+get unit values (this goes for all unit types):
+  - compute linear activation
+      x = sum_i(f(W_i, u_i))
+  - apply activation function  
+      a(x)
+  - sample
+      u ~ a(x)
+
+Computing a linear activation consists of summing all the contributions of the different Parameters
+
+a Sampler can just be the identify function (mean field, to get truncated exponential units) or a bernoulli sampler (most common), a softmax sampler, a max-pooling sampler, ...
+
+Base classes: ActivationFunction, Sampler, Units, Parameters
+
+Ideally, there would be no clear distinction between visibles and hiddens in the lowest layers of abstraction - allowing for more advanced models like the 3-way factored RBM to be implemented. The visibles-hiddens distinction is a useful abstraction for many types of RBMs however, so it should be implemented on top (maybe have an RBM class and a vhRBM subclass).
+
+
+Units, ActivationFunction and Sampler are AWARE of the RBM they are part of - most of the logic is concentrated in the subclasses. while this does bring some dependencies that could technically be avoided, it should lead to a cleaner architecture. It's also nicer to be able to do rbm.units['v'].sample(h_data), for example.
+
+
+
+
+
+
+
+
+implement a system where different types of units, weights and sampling can be combined easily. If it's performant that's a plus, but it doesn't have to be (mainly for experimentation). This will make it easier to construct heterogeneous models like the TransRBM or (different types of) the ChromaRBM.
+- determine the elementary operations that can be performed on an RBM, with their inputs and outputs
+    * sample hiddens
+    * sample visibles
+    * compute activation
+
+- determine the axes of variation, i.e. what can be done differently to create a new type of model:
+    * the way the activations are computed
+    * the way samples are drawn
+    * the types of parameters (weights, biases, convolutional, ...)
+    * the learning algorithm itself (CD, persistent CD, ...)
+
+- it would be good if certain hyperparameters (like the learning rate for different sets of trainable parameters)
+
+
+
+elementary RBM operations:
+  - train using CD-n, given unlabeled visibles data
+  - sample visibles from hiddens
+  - sample hiddens from visibles
+
+
+
+
diff --git a/convolutions b/convolutions
@@ -0,0 +1,81 @@
+W + 0.001 * (conv(v, sigmoid(conv(v, W) + bh)) - conv(sigmoid(conv + bv), sigmoid(conv + bh)))
+
+
+
+# conv input is (mb_size, input_maps, input height [numrows], input width [numcolumns])
+# conv input is (output_maps, input_maps, filter height [numrows], filter width [numcolumns])
+# conv output is (mb_size, output_maps, output height [numrows], output width [numcolumns])
+
+self.W = W # (hidden_maps, visible_maps, filter_height, filter_width)
+self.vu = units_list[0] # (mb_size, visible_maps, visible_height, visible_width)
+self.hu = units_list[1] # (mb_size, hidden_maps, hidden_height, hidden_width)
+
+
+hd = h0.dimshuffle(1,0,2,3)
+conv(v0, hd)
+
+v0: (mb_size, visible_maps, visible_height, visible_width)
+h0: (mb_size, hidden_maps, visible_height - filter_height + 1, visible_width - filter_width + 1)
+
+hd: (hidden_maps, mb_size, visible_height - filter_height + 1, visible_width - filter_width + 1)
+
+
+
+
+conv INPUT 1 (inputs):
+mb_size = mb_size
+input_maps = visible_maps
+input_height = visible_height
+input_width = visible_width
+
+conv INPUT 2 (filters):
+output_maps = hidden_maps
+input_maps = mb_size
+filter_height = visible_height - filter_height + 1
+filter_width = visible_width - filter_width + 1
+
+conv OUTPUT (outputs):
+mb_size = mb_size
+output_maps = hidden_maps
+output_height = filter_height
+output_width = filter_width
+
+
+DESIRED OUTPUT: (hidden_maps, visible_maps, filter_height, filter_width)
+
+
+
+-------------
+
+v0: (mb_size, visible_maps, visible_height, visible_width)
+h0: (mb_size, hidden_maps, visible_height - filter_height + 1, visible_width - filter_width + 1)
+
+vd: (visible_maps, mb_size, visible_height, visible_width)
+hd: (hidden_maps, mb_size, visible_height - filter_height + 1, visible_width - filter_width + 1)
+
+
+
+
+conv INPUT 1 (inputs): dimshuffle(1, 0, 2, 3)
+mb_size = visible_maps
+input_maps = mb_size
+input_height = visible_height
+input_width = visible_width
+
+conv INPUT 2 (filters): dimshuffle(1, 0, 2, 3)
+output_maps = hidden_maps
+input_maps = mb_size
+filter_height = visible_height - filter_height + 1
+filter_width = visible_width - filter_width + 1
+
+conv OUTPUT (outputs):
+mb_size = visible_maps
+output_maps = hidden_maps
+output_height = filter_height
+output_width = filter_width
+
+
+en dan nog de output dimshuffle(1, 0, 2, 3) en we zijn er!
+
+DESIRED OUTPUT: (hidden_maps, visible_maps, filter_height, filter_width)
+
diff --git a/morb/__init__.py b/morb/__init__.py
diff --git a/morb/activation_functions.py b/morb/activation_functions.py
@@ -0,0 +1,29 @@
+from morb.base import activation_function
+
+import theano.tensor as T
+
+
+@activation_function
+def sigmoid(x):
+    return T.nnet.sigmoid(x)
+
+@activation_function
+def identity(*x):
+    return x
+
+@activation_function
+def softmax(x):
+    # expected input dimensions:
+    # 0 = minibatches
+    # 1 = units
+    # 2 = states 
+
+    r = x.reshape((x.shape[0]*x.shape[1], x.shape[2]))
+    # r 0 = minibatches * units
+    # r 1 = states
+
+    # this is the expected input for theano.nnet.softmax
+    s = theano.nnet.softmax(r)
+
+    # reshape back to original shape
+    return s.reshape(x.shape)