The Apache Rya rya.mapreduce
project contains a set of classes facilitating the use of
Accumulo-backed Rya as the input source or output destination of Hadoop
MapReduce jobs.
RyaStatementWritable wraps a statement in a WritableComparable object, so triples can be used as keys or values in MapReduce tasks. Statements are considered equal if they contain equivalent triples and equivalent Accumulo metadata (visibility, timestamp, etc.).
Input formats are provided for reading triple data from Rya or from RDF files:
RdfFileInputFormat will read and parse RDF files of any format. Format must be explicitly specified. Reading and parsing is done asynchronously, enabling large input files depending on how much information the RDF4J parser itself needs to hold in memory in order to parse the file. (For example, large N-Triples files can be handled easily, but large XML files might require you to allocate more memory for the Map task.) Handles multiple files if given a directory as input, as long as all files are the specified format. Files will only be split if the format is set to N-Triples or N-Quads; otherwise, the number of input files will be the number of splits. Output pairs are
<LongWritable, RyaStatementWritable>
, where the former is the number of the statement within the input split and the latter is the statement itself. -
RyaInputFormat will read statements directly from a Rya table in Accumulo. Extends Accumulo's AbstractInputFormat and uses that class's configuration methods to configure the connection to Accumulo. The table scanned should be one of the Rya core tables (spo, po, or osp), and whichever is used should be specified using
, so the input format can deserialize the statements correctly. Choice of table may influence parallelization if the tables are split differently in Accumulo. (The number of splits in Accumulo will be the number of input splits in Hadoop and therefore the number of Mappers.) Output pairs are<Text, RyaStatementWritable>
, where the former is the Accumulo row ID and the latter is the statement itself.
An output format is provided for writing triple data to Rya:
- RyaOutputFormat will insert statements into the Rya core tables and/or any
configured secondary indexers. Configuration options include:
- Table prefix: identifies Rya instance
- Default visibility: any statement without a visibility set will be written with this visibility
- Default context: any statement without a context (named graph) set will be written with this context
- Enable freetext index, geo index, temporal index, entity index, and core
tables: separate options for configuring exactly which indexers to use.
If using secondary indexers, consider providing configuration variables
"sc.freetext.predicates", "sc.geo.predicates", and "sc.temporal.predicates"
as appropriate; otherwise each indexer will attempt to index every
Expects input pairs
<Writable, RyaStatementWritable>
. Keys are ignored and values are written to Rya.
MRUtils defines constant configuration parameter names used for passing Accumulo connection information, Rya prefix and table layout, RDF format, etc., as well as some convenience methods for getting and setting these values with respect to a given Configuration.
AbstractAccumuloMRTool can be used as a base class for Rya MapReduce Tools
using the ToolRunner API. It extracts necessary parameters from the
configuration and provides methods for setting input and/or output formats and
configuring them accordingly. To use, extend this class and implement run
In the run method, call init
to extract and validate configuration values from
the Hadoop Configuration. Then use setup*(Input/Output)
methods as needed to
configure input and output for MapReduce jobs using the stored parameters.
(Input and output formats can then be configured directly, if necessary.)
Expects parameters to be specified in the configuration using the names defined
in MRUtils, or for secondary indexers, the names in
See the examples
subpackage for examples of how to use the interface, and the
subpackage for some individual MapReduce applications.