Skip to content

Latest commit

 

History

History
127 lines (94 loc) · 5.44 KB

File metadata and controls

127 lines (94 loc) · 5.44 KB

Hadoop connector

Parent document: Connectors

Main function

Hadoop connector can be used to read hdfs files in batch scenarios. Its function points mainly include:

  • Support reading files in multiple hdfs directories at the same time
  • Support reading hdfs files of various formats

Maven dependency

<dependency>
   <groupId>com.bytedance.bitsail</groupId>
   <artifactId>bitsail-connector-hadoop</artifactId>
   <version>${revision}</version>
</dependency>

Supported data types

  • Basic data types supported by Hadoop connectors:
    • Integer type:
      • short
      • int
      • long
      • biginterger
    • Float type:
      • float
      • double
      • bigdecimal
    • Time type:
      • timestamp
      • date
      • time
    • String type:
      • string
    • Bool type:
      • boolean
    • Binary type:
      • binary
  • Composited data types supported by Hadoop connectors:
    • map
    • list

Parameters

The following mentioned parameters should be added to job.reader block when using, for example:

{
  "job": {
    "reader": {
      "path_list": "hdfs://test_path/test.csv"
    }
  }
}

Necessary parameters

Param name Required Optional value Description
class Yes Class name of hadoop connector, com.bytedance.bitsail.connector.hadoop.source.HadoopInputFormat
path_list Yes Specifies the path of the read in file. Multiple paths can be specified, separated by ','
content_type Yes JSON
CSV
Specify the format of the read in file. For details, refer to支持的文件格式
columns Yes Describing fields' names and types

Optional parameters

Param name Required Optional value Description
hadoop_conf No Specify the read configuration of hadoop in the standard json format string
reader_parallelism_num No Reader parallelism

Supported format

Support the following formats:

JSON

It supports parsing text files in json format. Each line is required to be a standard json string.

The following parameters are supported to adjust the json parsing stype:

Parameter name Default value Description
job.common.case_insensitive true Whether to be sensitive to the case of the key in the json field
job.common.json_serializer_features Specify the mode when 'FastJsonUtil' is parsed. The format is ',' separated string, for example"QuoteFieldNames,UseSingleQuotes"
job.common.convert_error_column_as_null false Whether to set the field with parsing error to null

CSV

Support parsing of text files in csv format. Each line is required to be a standard csv string.

The following parameters are supported to adjust the csv parsing style:

Parameter name Default value Description
job.common.csv_delimiter ',' csv delimiter
job.common.csv_escape escape character
job.common.csv_quote quote character
job.common.csv_with_null_string Specify the conversion value of null field. It is not converted by default

Related documents

Configuration examples: Hadoop connector example