-
Notifications
You must be signed in to change notification settings - Fork 29
Loading data
Ondřej Moravčík edited this page Jun 12, 2015
·
7 revisions
Every RDD will be serialized by the same serializer defined on spark.ruby.serializer*
options. If you want to have custom serializer for some RDD you can build one.
- All serializers can be found at RubyDoc.info
# First way
marshal1 = Spark::Serializer::Marshal.new
compressed1 = Spark::Serializer::Compressed.new(marshal1)
serializer = Spark::Serializer::AutoBatched.new(compressed1)
# Second way
serializer = Spark::Serializer.build { auto_batched(compressed(marshal)) }
# Third way
serializer = Spark::Serializer.build("auto_batched(compressed(marshal))")
Data can be upload as single file.
rdd = sc.text_file(FILE, workers_num, serializer=nil)
All files on directory.
rdd = sc.whole_text_files(DIRECTORY, workers_num, serializer=nil)
Direct. Data must be iterable and choosen serializer must be able to serialized them.
rdd = sc.parallelize(data, workers_num, serializer=nil)
rdd = sc.parallelize([1,2,3,4,5], workers_num, serializer=nil)
rdd = sc.parallelize(1..5, workers_num, serializer=nil)
- workers_num
-
Min count of works computing this task.
(This value can be overwriten by spark) - serializer
- Custom serializer.