Skip to content

VM hadoop setup instructions

Yuyi Guo edited this page Dec 18, 2015 · 14 revisions

Instruction to setup CERN VM with Hadoop, WMArchive

Request VM from openstack.cern.ch

Install admin frontend das mongodb packages in a normal way how we install stuff on cmsweb VM [1]

I created /data/wma area and copied stuff from my lxplus.cern.ch:~valya/workspace/wma/ area over there

Create install area

  mkdir -p /data/wma/usr/lib/python2.7/site-packages

Create setup.sh file

#!/bin/bash
source /data/srv/current/apps/das/etc/profile.d/init.sh
export JAVA_HOME=/usr/lib/jvm/java
#export PATH=$PATH:$PWD/mongodb/bin
export PYTHONPATH=$PYTHONPATH:/data/wma/usr/lib/python2.7/site-packages

Set-up your environment source setup.sh, this will setup MongoDB, python 2.7, pymongo

Install pip [2] (optional) ''' curl https://bootstrap.pypa.io/get-pip.py > get-pip.py python get-pip.py ''' Install java on VM

  sudo yum install java-1.8.0-openjdk-devel.x86_64

Create /etc/yum.repos.d/cloudera.repo with the following content:

[cloudera]
gpgcheck=0
name=Cloudera
enabled=1
priority=15
baseurl=https://cern.ch/it-service-hadoop/yum/cloudera-cdh542

Install hadoop

  sudo yum install hadoop-hdfs.x86_64 hadoop.x86_64 hive.noarch hadoop-libhdfs.x86_64
  sudo yum install hadoop-hdfs-namenode.x86_64 hadoop-hdfs-datanode.x86_64

Configure hadoop [3, 4]

 sudo cp -r /etc/hadoop/conf.empty /etc/hadoop/conf.my_cluster
 sudo alternatives --install /etc/hadoop/conf hadoop-conf /etc/hadoop/conf.my_cluster 50
 sudo alternatives --set hadoop-conf /etc/hadoop/conf.my_cluster
 ls /etc/hadoop/conf.my_cluster/
 sudo vim /etc/hadoop/conf.my_cluster/core-site.xml

Here is relevant part you should have in core-site.xml

    <configuration>
        <property>
            <name>fs.defaultFS</name>
            <value>hdfs://localhost:9000</value>
        </property>
    </configuration>

Adjust hdfs-site.xml

 sudo vim /etc/hadoop/conf.my_cluster/hdfs-site.xml

Create relevant part in hdfs-site.xml

    <property>
       <name>dfs.namenode.name.dir</name>
       <value>file:///var/lib/hadoop-hdfs/cache/hdfs/dfs/name</value>
    </property>

Format local HDFS

  sudo -u hdfs hdfs namenode -format

Start HDFS

  cd /etc/init.d/
  sudo service hadoop-hdfs-datanode start
  sudo service hadoop-hdfs-namenode start

Create some areas on HDFS

  sudo -u hdfs hadoop fs -mkdir /tmp
  sudo -u hdfs hadoop fs -chmod -R 1777 /tmp
  sudo -u hdfs hadoop fs -mkdir /test
  sudo -u hdfs hadoop fs -chmod -R 1777 /test
  hadoop fs -ls /tmp
  # now we ready to put anything to hadoop, e.g.
  hadoop fs -put local_file /tmp
  hadoop fs -ls /tmp

Install pydoop

  cd /data/wma/soft/pydoop
  python setup.py install --prefix=/data/wma/usr

Install avro

  cd /data/wma/soft/avro-1.7.7
  python setup.py install --prefix=/data/wma/usr

Fetch WMCore framework

  git clone [email protected]:dmwm/WMCore.git

Get WMArchive framework

  git clone [email protected]:dmwm/WMArchive.git

Remove DAS from deploy area, otherwise it will be started

  rm /data/srv/enabled/das

Adjust wmarch_config.py file to use your favorite storage, e.g.

  mkdir /data/wma/storage

and setup ROOTDIR to /data/wma in wmarch_config.py

Start WMArchive service

  cd /data/wma
  ./run_wma.sh

Run simple test with fileio storage, post some data and retrieve it back

  curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"data\":{\"name\":1}}" http://localhost:8246/wmarchive/data/
  curl -D /dev/stdout -H "Content-type: application/json" http://localhost:8246/wmarchive/data/eed35faf3b73d58157aa53d097899e8d

Here are some commands to use

# single document injection
curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"data\":{\"name\":1}}" http://localhost:8246/wmarchive/data/
# single document retrieval
curl -D /dev/stdout -H "Content-type: application/json" http://localhost:8246/wmarchive/data/eed35faf3b73d58157aa53d097899e8d

# multiple documents injection
curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"data\":[{\"name\":1}, {\"\name\":2}]}" http://localhost:8246/wmarchive/data/

# multiple documents retrieval
curl -D /dev/stdout -X POST -H "Content-type: application/json" -d "{\"query\":[\"eed35faf3b73d58157aa53d097899e8d\", \"bcee13403f554bc14f644ffdeaa93372\"]}" http://localhost:8246/wmarchive/data/
  1. https://cms-http-group.web.cern.ch/cms-http-group/tutorials/environ/vm-setup.html
  2. https://pip.pypa.io/en/stable/installing/
  3. https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-common/SingleCluster.html#Standalone_Operation
  4. http://www.cloudera.com/content/www/en-us/documentation/cdh/5-0-x/CDH5-Installation-Guide/cdh5ig_hdfs_cluster_deploy.html?scroll=topic_11_2