Haderp

WordCount

####Source: hadoop-mapreduce-examples-2.5.0-cdh5.3.0-sources

####Deps:

2002_sou.txt

####Run:

Put unzipped files from Deps into an hdfs folder called wordcount_in

hdfs dfs -fs "hdfs://namenode" -mkdir wordcount_in
stdbuf -i0 -o0 -e0 unzip -p acite75_99.zip | hdfs dfs -fs "hdfs://namenode" -put - wordcount_in/acite75_99.txt
stdbuf -i0 -o0 -e0 unzip -p apat63_99.zip | hdfs dfs -fs "hdfs://namenode" -put - wordcount_in/apat63_99.txt

Run WordCount

hadoop jar experimental-jobs-1.0.jar job.WordCount -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung wordcount_in wordcount_out
# or
HADOOP_CLASSPATH=~/src/haderp/build/classes/main/job hadoop WordCount -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung wordcount_in wordcount_out

Run WordCount2

hadoop jar experimental-jobs-1.0.jar job.WordCount2 -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung wordcount_in wordcount2_out
# or
HADOOP_CLASSPATH=~/src/haderp/build/classes/main/job hadoop WordCount2 -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung wordcount_in wordcount2_out

Patent Jobs

####Source: Hadoop in Action 2nd Ed. Listing 4.1 and Section 4.3

####Deps:

curl -O http://www.nber.org/patents/acite75_99.zip
curl -O http://www.nber.org/patents/apat63_99.zip

####Setup:

Put unzipped files from Deps into hdfs folders

hdfs dfs -fs "hdfs://namenode" -mkdir patent_in
stdbuf -i0 -o0 -e0 unzip -p apat63_99.zip | hdfs dfs -fs "hdfs://namenode" -put - patent_in/apat63_99.txt

hdfs dfs -fs "hdfs://namenode" -mkdir patent_cite_in
stdbuf -i0 -o0 -e0 unzip -p acite75_99.zip | hdfs dfs -fs "hdfs://namenode" -put - patent_in/acite75_99.txt

####Run:

PatentCiters

hadoop jar experimental-jobs-1.0.jar job.PatentCiters -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung patent_cite_in patent_cite_out
# or
HADOOP_CLASSPATH=~/src/haderp/build/classes/main/job hadoop PatentCiters -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung patent_cite_in patent_cite_out

PatentCitationCount

hadoop jar experimental-jobs-1.0.jar job.PatentCitationCount -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung patent_cite_in patent_count_out
# or
HADOOP_CLASSPATH=~/src/haderp/build/classes/main/job hadoop PatentCitationCount -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung patent_cite_in patent_count_out

Setup dev environment

Since we're building against hadoop version 2.5.2, pull down the supported version of Java from Oracle. If you don't have an account at Oracle, you'll need to create one first.

Log into Oracle's site and download this JDK installer
Install Java JDK from the downloaded package file
Add the following to your .bashrc

export JAVA_HOME=$(/usr/libexec/java_home -v 1.7*)

sudo mkdir /opt/gradle
cd /opt/gradle
Download gradle from here to /opt/gradle. The version used here is 2.2.1
sudo unzip gradle-2.2.1-all.zip
ln -s gradle-2.2.1/ current
Add the following to your .bashrc

export GRADLE_HOME=/opt/gradle/current
PATH=$PATH:$GRADLE_HOME/bin
export PATH

Setup a local hadoop instance

You'll need this to be able to run jobs against the cluster. It will give you the hadoop and hdfs commands at the command line.

Download the following tar files from cloudera

curl -O http://archive.cloudera.com/cdh5/cdh/5/avro-1.7.6-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/crunch-0.11.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/datafu-1.1.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/flume-ng-1.5.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/hadoop-2.5.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/hbase-0.98.6-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/hbase-solr-1.5-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/hive-0.13.1-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/hue-3.7.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/kite-0.15.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/llama-1.0.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/mahout-0.9-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/oozie-4.0.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/parquet-1.5.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/parquet-format-2.1.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/pig-0.12.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/search-1.0.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/sentry-1.4.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/solr-4.4.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/spark-1.2.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/sqoop-1.4.5-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/sqoop2-1.99.4-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/whirr-0.9.0-cdh5.3.1.tar
curl -O http://archive.cloudera.com/cdh5/cdh/5/zookeeper-3.4.5-cdh5.3.1.tar

Extract cloudera hadoop into opt/

sudo mkdir -p /opt/cloudera/cdh5.3.1
sudo sh -c "ls *.tar | xargs -I{} tar xvf {} -C /opt/cloudera/cdh5.3.1"

Setup links

sudo -i
mkdir -p /opt/cloudera/current
ln -s /opt/cloudera/cdh5.3.1/hadoop-2.5.0-cdh5.3.1 /opt/cloudera/current/hadoop
ln -s /opt/cloudera/cdh5.3.1/hbase-0.98.6-cdh5.3.1 /opt/cloudera/current/hbase
ln -s /opt/cloudera/cdh5.3.1/hive-0.13.1-cdh5.3.1 /opt/cloudera/current/hive
ln -s /opt/cloudera/cdh5.3.1/zookeeper-3.4.5-cdh5.3.1 /opt/cloudera/current/zookeeper
ls -l /opt/cloudera/current/

Add the following to your .bash_rc

export HADOOP_HOME="/opt/cloudera/current/hadoop"
export HBASE_HOME="/opt/cloudera/current/hbase"
export HIVE_HOME="/opt/cloudera/current/hive"
export HCAT_HOME="/opt/cloudera/current/hive/hcatalog"
export PATH=${PATH}:${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${HBASE_HOME}/bin:${HIVE_HOME}/bin:${HCAT_HOME}/bin

Building jobs jar

There is a gradle build file that will pull down hadoop dependencies and build the job classes and the jar file. Just cd into the local copy of this repo and run gradle build. You can clean the project with gradle clean

Running jobs from the jar

The jobs in the jar file inherit from Configured and Tool which allows you to specify the NameServer with the command line switch -fs. You can also specify the user with a -D property like -D hadoop.job.ugi=<user>

hadoop jar experimental-jobs-1.0.jar job.PatentCiters -fs "hdfs://namenode" -D hadoop.job.ugi=peteyoung inputDir outputDir

Running jobs from a class file

The class files will be located in build/classes/main under the project directory. Currently they are in the top level package.

cd into the build/classes/main/job directory
Run the job like so:

HADOOP_CLASSPATH=. hadoop PatentCiters -fs "hdfs://namenode:8020" -D hadoop.job.ugi=peteyoung inputDir outputDir

GHCN Weather Data

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
.idea		.idea
src		src
.gitignore		.gitignore
README.md		README.md
build.gradle		build.gradle
haderp.iml		haderp.iml
settings.gradle		settings.gradle

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Haderp

WordCount

Patent Jobs

Setup dev environment

Setup a local hadoop instance

Building jobs jar

Running jobs from the jar

Running jobs from a class file

About

Releases

Packages

Languages

peteyoung/haderp

Folders and files

Latest commit

History

Repository files navigation

Haderp

WordCount

Patent Jobs

Setup dev environment

Setup a local hadoop instance

Building jobs jar

Running jobs from the jar

Running jobs from a class file

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages