-
Notifications
You must be signed in to change notification settings - Fork 2
thwessling/ML-Hadoop-Planet
Folders and files
Name | Name | Last commit message | Last commit date | |
---|---|---|---|---|
Repository files navigation
Massively parallelized growing of random decision trees for machine learning using Hadoop. ************************* Building ************************* Build the code with the following command: > ant jar This should compile all classes and create the jar file planet.jar. ************************* Running training ************************* 1) Run the init.sh script to copy instances to the appropriate hdfs location 2) Remove any _local_ output directories (planet_train_out.*) 3) Start training with > hadoop jar planet.jar 4) The resulting model will be written to tree_model_local.ser ************************* Testing the model ************************* 1) Start testing the produced model with the command > ant ClassifierTest For testing, data/test.txt is used ************************************ Data format ************************************ 1) The description of instance files is expected at data/features.txt. This file contains one line for each feature, specifying the name and the feature range. Name and value range are separated by a colon. For ordered attributes, the feature range can be input as MIN_VALUE-MAX-VALUE. For unordered attributes, each possible value needs to be specified and values are separated by a semicolon. Note that only numeric values are supported and that the feature range for ordered attributes a step size of 1.0 for values (i.e. only integer values are supported). Any input data set needs to be preprocessed and normalized to follow this notation. 2) The instance file itself consists of one instance per line, while features are separated by a comma. The last line is used as the class label.