-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snowplow Collector #16
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,19 @@ | ||
apiVersion: v1 | ||
name: snowplow-stream-collector | ||
description: A Helm Chart to deploy the Snowplow Stream Collector project | ||
version: 0.1.0 | ||
appVersion: "2.5.0" | ||
icon: https://raw.githubusercontent.com/snowplow-devops/helm-charts/master/docs/logo/snowplow.png | ||
home: https://github.com/snowplow-devops/helm-charts | ||
sources: | ||
- https://github.com/snowplow-devops/helm-charts | ||
- https://github.com/snowplow/stream-collector | ||
maintainers: | ||
- name: jbeemster | ||
url: https://github.com/jbeemster | ||
email: [email protected] | ||
keywords: | ||
- snowplow | ||
- pipeline | ||
- collector | ||
- schemas |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,246 @@ | ||
# snowplow-stream-collector | ||
|
||
A helm chart for [Snowplow Stream Collector](https://github.com/snowplow/stream-collector). | ||
|
||
## Installing the Chart | ||
|
||
Add the repository to Helm: | ||
|
||
```bash | ||
helm repo add snowplow-devops https://snowplow-devops.github.io/helm-charts | ||
``` | ||
|
||
Install or upgrading the chart with default configuration: | ||
|
||
```bash | ||
helm upgrade --install snowplow-stream-collector snowplow-devops/snowplow-stream-collector | ||
``` | ||
|
||
## Uninstalling the Chart | ||
|
||
To uninstall/delete the `snowplow-stream-collector` release: | ||
|
||
```bash | ||
helm delete snowplow-stream-collector | ||
``` | ||
|
||
## Deployment options | ||
|
||
The Collector is designed to run in the public cloud but can also be run in local distributions and has support for a wide-array of backends. This chart supports all of these available options. | ||
|
||
First determine the target you want to send data to and then build a valid config for the Collector - you can [view examples here](https://github.com/snowplow/stream-collector/tree/master/examples). The default installation writes everything to stdout. | ||
|
||
--- | ||
*WARNING*: It is recommended to use `port = ${COLLECTOR_PORT}` in your config as then the chart can ensure the correct port is set in your configuration file. See the example configurations for how this looks. | ||
--- | ||
|
||
#### Configure end2end TLS | ||
|
||
Due to a known issue in AkkaHTTP when handling TLS termination we now embed an NGINX proxy as an _optional_ side-car to the Collector - this replaces terminating TLS directly in the Collector itself and is the safer alternative for the moment. | ||
|
||
This also allows us to obfuscate the server being used for the Collector application itself. | ||
|
||
To enable TLS you will need to: | ||
|
||
1. Set `service.nginx.enable: true` | ||
2. Set `service.ssl.enable: true` | ||
3. Generate and pass both the certificate and private key in base64 encoded format in the appropriate fields | ||
|
||
The deployment will then be bound on the defined SSL port and will forward connections to the Collector side-car container directly. | ||
|
||
### On-premise deployment | ||
|
||
For fast testing and implementations where you do not care about integrating with public-cloud systems. | ||
|
||
#### Target: Stdout | ||
|
||
The simplest option is `stdout` which will send all events received directly to logging output: | ||
|
||
``` | ||
helm upgrade --install snowplow-stream-collector . | ||
``` | ||
|
||
#### Target: Kafka | ||
|
||
To test out Kafka support you can spin up a local cluster and then pipe data into it. We are using the `bitnami` chart to simplify the deployment: | ||
|
||
``` | ||
# Deploy a default Kafka cluster | ||
helm upgrade --install kafka bitnami/kafka | ||
|
||
# Deploy the collector sending data to Kafka | ||
helm upgrade --install snowplow-stream-collector \ | ||
--set service.config.hoconBase64=$(cat examples/kafka.hocon | base64) \ | ||
--set service.image.target=kafka . | ||
``` | ||
|
||
You can then setup your own Kafka consumer to pull down the data from created topics (good & bad). | ||
|
||
### GCP (GKE) settings | ||
|
||
#### Network Endpoint Group binding | ||
|
||
To manage the load balancer externally to the kubernetes cluster you can bind the deployment to dynamically assigned Network Endpoint Group (NEG). | ||
|
||
1. Set the NEG name: `service.gcp.networkEndpointGroupName: <valid_value>` | ||
2. This will create Zonal NEGs in your account automatically (do not proceed until the NEGs appear - check your deployment events if this doesn't happen!) | ||
3. Create a Load Balancer as usual and map the NEGs created into your backend service (follow the `Create Load Balancer` flow in the GCP Console) | ||
|
||
_Note_: The HealthCheck you create should map to the same port you used for the service deployment. | ||
|
||
#### Target: PubSub | ||
|
||
To send data into PubSub you will need to bind a valid GCP service-account to the service deployment. In Terraform this looks something like the following: | ||
|
||
```hcl | ||
resource "google_service_account" "snowplow_stream_collector" { | ||
account_id = "snowplow-stream-collector" | ||
display_name = "Snowplow Stream Collector service account" | ||
} | ||
|
||
resource "google_service_account_iam_binding" "snowplow_stream_collector_sa_wiu_binding" { | ||
role = "roles/iam.workloadIdentityUser" | ||
members = [ | ||
"serviceAccount:<project>.svc.id.goog[default/snowplow-stream-collector]" | ||
] | ||
service_account_id = google_service_account.snowplow_stream_collector.id | ||
} | ||
|
||
resource "google_project_iam_member" "snowplow_stream_collector_sa_pubsub_viewer" { | ||
role = "roles/pubsub.viewer" | ||
member = "serviceAccount:${google_service_account.snowplow_stream_collector.email}" | ||
} | ||
|
||
resource "google_project_iam_member" "snowplow_stream_collector_sa_pubsub_publisher" { | ||
role = "roles/pubsub.publisher" | ||
member = "serviceAccount:${google_service_account.snowplow_stream_collector.email}" | ||
} | ||
|
||
output "snowplow_stream_collector_sa_account_email" { | ||
value = google_service_account.snowplow_stream_collector.email | ||
} | ||
``` | ||
|
||
You can then use the resulting value as an input to `serviceAccount.gcp.serviceAccount` which will allow the deployment to access PubSub. | ||
|
||
You will need to fill these targeted fields: | ||
|
||
- `cloud: "gcp"` | ||
- `serviceAccount.deploy: true` | ||
- `serviceAccount.gcp.serviceAccount: <output_value_from_above>` | ||
|
||
### AWS (EKS) settings | ||
|
||
#### TargetGroup binding | ||
|
||
To manage the load balancer externally to the kubernetes cluster you can bind the deployment to an existing TargetGroup ARN. Its important that the TargetGroup exist ahead of time and that you use the same port as you have used in your `values.yaml`. | ||
|
||
_Note_: Before this will work you will need to install the `aws-load-balancer-controller-crds` and `aws-load-balancer-controller` charts into your EKS cluster. | ||
|
||
You will need to fill these targeted fields: | ||
|
||
- `cloud: "aws"` | ||
- `service.aws.targetGroupARN: "<target_group_arn>"` | ||
|
||
#### Target: Kinesis and/or SQS | ||
|
||
To send data into Kinesis and/or SQS without hardcoded credentials you will need to bind a valid AWS IAM role ARN to the service deployment. In Terraform this looks something like the following: | ||
|
||
```hcl | ||
resource "aws_iam_policy" "snowplow_stream_collector" { | ||
name = "snowplow-stream-collector" | ||
|
||
policy = <<EOF | ||
{ | ||
"Version": "2012-10-17", | ||
"Statement": [ | ||
{ | ||
"Action": [ | ||
"kinesis:DescribeStream", | ||
"kinesis:DescribeStreamSummary", | ||
"kinesis:List*", | ||
"kinesis:Put*", | ||
"sqs:GetQueueUrl", | ||
"sqs:SendMessage", | ||
"sqs:ListQueues" | ||
], | ||
"Resource": "*", | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would suggest mentioning that best security policy is to restrict the resource down to just what you need. For example, if your kinesis streams were called
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yeah thats a fair point - I have done this in the OS Terraform Modules - will replicate it here. |
||
"Effect": "Allow" | ||
} | ||
] | ||
} | ||
EOF | ||
} | ||
|
||
module "snowplow_stream_collector" { | ||
source = "terraform-aws-modules/iam/aws//modules/iam-assumable-role-with-oidc" | ||
version = "~> 4.7.0" | ||
|
||
create_role = true | ||
|
||
role_name = "snowplow-stream-collector" | ||
|
||
provider_url = var.eks_cluster_url | ||
|
||
role_policy_arns = [ | ||
aws_iam_policy.snowplow_stream_collector.arn | ||
] | ||
number_of_role_policy_arns = 1 | ||
|
||
oidc_fully_qualified_subjects = [ | ||
"system:serviceaccount:default:snowplow-stream-collector" | ||
] | ||
} | ||
|
||
output "snowplow_stream_collector_role_arn" { | ||
value = module.snowplow_stream_collector.iam_role_arn | ||
} | ||
``` | ||
|
||
You can then use the resulting value as an input to `serviceAccount.aws.roleARN` which will allow the deployment to access Kinesis and SQS. | ||
|
||
You will need to fill these targeted fields: | ||
|
||
- `cloud: "aws"` | ||
- `serviceAccount.deploy: true` | ||
- `serviceAccount.aws.roleARN: <output_value_from_above>` | ||
|
||
## Configuration | ||
|
||
| Key | Type | Default | Description | | ||
|-----|------|---------|-------------| | ||
| cloud | string | `""` | Cloud specific bindings (options: aws, gcp) | | ||
| secrets.docker.email | string | `""` | Email address for user of the private repository | | ||
| secrets.docker.name | string | `"dockerhub"` | Name of the secret to use for the private repository | | ||
| secrets.docker.password | string | `""` | Password for the private repository | | ||
| secrets.docker.server | string | `"https://index.docker.io/v1/"` | Repository server URL | | ||
| secrets.docker.username | string | `""` | Username for the private repository | | ||
| service.aws.targetGroupARN | string | `""` | EC2 TargetGroup ARN to bind the service onto | | ||
| service.config.hoconBase64 | string | `""` | | | ||
| service.config.javaOpts | string | `""` | | | ||
| service.gcp.networkEndpointGroupName | string | `""` | Name of the Network Endpoint Group to bind onto (default: .Release.Name) | | ||
| service.image.isRepositoryPublic | bool | `true` | | | ||
| service.image.repository | string | `"snowplow/scala-stream-collector"` | | | ||
| service.image.tag | string | `"2.5.0"` | | | ||
| service.image.target | string | `"stdout"` | Which image should be pulled (options: stdout, nsq, kinesis, sqs, kafka or pubsub) | | ||
| service.maxReplicas | int | `4` | | | ||
| service.minReplicas | int | `1` | | | ||
| service.nginx.deploy | bool | `false` | Whether to serve request with an NGINX proxy side-car instead of the Collector directly | | ||
| service.nginx.image.isRepositoryPublic | bool | `true` | | | ||
| service.nginx.image.repository | string | `"nginx"` | | | ||
| service.nginx.image.tag | string | `"stable-alpine"` | | | ||
| service.port | int | `8080` | HTTP port to bind and expose the service on | | ||
| service.readinessProbe.failureThreshold | int | `3` | | | ||
| service.readinessProbe.initialDelaySeconds | int | `5` | | | ||
| service.readinessProbe.periodSeconds | int | `5` | | | ||
| service.readinessProbe.successThreshold | int | `2` | | | ||
| service.readinessProbe.timeoutSeconds | int | `5` | | | ||
| service.ssl.certificateBase64 | string | `""` | Certificate in PEM form | | ||
| service.ssl.certificatePrivateKeyBase64 | string | `""` | Certificate Private Key in PEM form | | ||
| service.ssl.enable | bool | `false` | Whether to enable the TLS port (requires service.nginx.deploy to be true) | | ||
| service.ssl.port | int | `8443` | HTTPS port to bind and expose the service on | | ||
| service.targetCPUUtilizationPercentage | int | `75` | | | ||
| service.terminationGracePeriodSeconds | int | `630` | | | ||
| serviceAccount.aws.roleARN | string | `""` | IAM Role ARN to bind to the k8s service account | | ||
| serviceAccount.deploy | bool | `false` | Whether to create a service-account | | ||
| serviceAccount.gcp.serviceAccount | string | `""` | Service Account email to bind to the k8s service account | |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
upstream ssc { | ||
server localhost:{{ include "collector.port" . }}; | ||
jbeemster marked this conversation as resolved.
Show resolved
Hide resolved
|
||
} | ||
|
||
server { | ||
server_tokens off; | ||
listen {{ .Values.service.port }}; | ||
access_log /dev/null; | ||
error_log /dev/null; | ||
|
||
location / { | ||
proxy_set_header Host $host; | ||
proxy_set_header X-Forwarded-For $http_x_forwarded_for; | ||
proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto; | ||
proxy_pass http://ssc; | ||
} | ||
} | ||
|
||
{{- if .Values.service.ssl.enable }} | ||
server { | ||
server_tokens off; | ||
listen {{ .Values.service.ssl.port }}; | ||
access_log /dev/null; | ||
error_log /dev/null; | ||
|
||
ssl on; | ||
ssl_certificate /etc/nginx/ssl/collector_cert.pem; | ||
ssl_certificate_key /etc/nginx/ssl/collector_key.pem; | ||
|
||
location / { | ||
proxy_set_header Host $host; | ||
proxy_set_header X-Forwarded-For $http_x_forwarded_for; | ||
proxy_set_header X-Forwarded-Proto $http_x_forwarded_proto; | ||
proxy_pass http://ssc; | ||
} | ||
} | ||
{{- end }} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
{ | ||
"auths":{ | ||
"{{ .Values.secrets.docker.server }}":{ | ||
"username":"{{ .Values.secrets.docker.username }}", | ||
"password":"{{ .Values.secrets.docker.password }}", | ||
"email":"{{ .Values.secrets.docker.email }}", | ||
"auth":"{{ printf "%s:%s" .Values.secrets.docker.username .Values.secrets.docker.password | b64enc }}" | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,15 @@ | ||
collector { | ||
interface = "0.0.0.0" | ||
port = ${COLLECTOR_PORT} | ||
|
||
streams { | ||
good = "good" | ||
bad = "bad" | ||
|
||
# Assumes you have a locally deployed Kafka cluster | ||
# e.g. helm upgrade --install kafka bitnami/kafka | ||
sink { | ||
brokers = "kafka-0.kafka-headless.default.svc.cluster.local:9092" | ||
} | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,9 @@ | ||
collector { | ||
interface = "0.0.0.0" | ||
port = ${COLLECTOR_PORT} | ||
|
||
streams { | ||
good = "good" | ||
bad = "bad" | ||
} | ||
} |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,24 @@ | ||
{{- if eq .Values.service.image.target "stdout" }} | ||
------------------------------------------------------------------------------------------------------------------------ | ||
WARNING: Your Collector is running in stdout mode which means all collected events are written directly to the pod logs. | ||
In production you will need to target an external stream to persist events safely and allow for consumption by | ||
downstream services like Snowplow Enrich. | ||
------------------------------------------------------------------------------------------------------------------------ | ||
{{- end }} | ||
|
||
The Collector can be accessed via port {{ .Values.service.port }} on the following DNS names from within your cluster: | ||
|
||
{{ .Release.Name }}.{{ .Release.Namespace }}.svc.cluster.local | ||
|
||
To connect to your server from outside the cluster execute the following commands: | ||
|
||
kubectl port-forward --namespace {{ .Release.Namespace }} svc/{{ .Release.Name }} {{ .Values.service.port }}:{{ .Values.service.port }} | ||
|
||
You can check if the service is healthy by hitting the following endpoint: | ||
|
||
http://localhost:{{ .Values.service.port }}/health | ||
|
||
You can send a test event through (note: in stdout mode your events will land in the pod logs directly): | ||
|
||
http://localhost:{{ .Values.service.port }}/i?e=pv | ||
http://localhost:{{ .Values.service.port }}/i |
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,32 @@ | ||
{{/* vim: set filetype=mustache: */}} | ||
{{/* | ||
Define default values for required values. | ||
*/}} | ||
|
||
{{- define "service.targetCPUUtilizationPercentage" -}} | ||
{{- mul .Values.service.targetCPUUtilizationPercentage .Values.service.minReplicas }} | ||
{{- end -}} | ||
|
||
{{- define "service.gcp.networkEndpointGroupName" -}} | ||
{{- default .Release.Name .Values.service.gcp.networkEndpointGroupName -}} | ||
{{- end -}} | ||
|
||
{{- define "service.config.hoconBase64" -}} | ||
{{- if eq .Values.service.config.hoconBase64 "" }} | ||
{{- tpl (.Files.Get "examples/stdout.hocon") . | b64enc -}} | ||
{{- else -}} | ||
{{- .Values.service.config.hoconBase64 -}} | ||
{{- end -}} | ||
{{- end -}} | ||
|
||
{{- define "service.nginx.confBase64" -}} | ||
{{- tpl (.Files.Get "configs/collector.conf") . | b64enc -}} | ||
{{- end -}} | ||
|
||
{{- define "collector.port" -}} | ||
{{- if .Values.service.nginx.deploy }} | ||
{{- add .Values.service.port 1 -}} | ||
{{- else -}} | ||
{{- .Values.service.port -}} | ||
{{- end -}} | ||
{{- end -}} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the restricted permissions here.