Merge pull request #72 from netsage-project/feature/docs_updates

Pinning the documentation to version 1.2.7
netsage-project · Nov 17, 2020 · 4f56721 · 4f56721
2 parents fbf40b1 + c43c9c9
commit 4f56721
Show file tree

Hide file tree

Showing 19 changed files with 1,412 additions and 0 deletions.
diff --git a/website/versioned_docs/version-1.2.7/components/docker_env.md b/website/versioned_docs/version-1.2.7/components/docker_env.md
@@ -0,0 +1,42 @@
+Please copy `env.example` to `.env`  
+```sh
+cp env.example .env 
+```
+
+then edit the .env file to set the sensor names
+```sh
+sflowSensorName="my sflow sensor name"
+netflowSensorName="my netflow sensor name"
+```
+Simply change the names to unique identifiers and you're good to go. (Use quotes if the names have spaces.) 
+
+:::note
+These names uniquely identify the source of the data. In elasticsearch, they are saved in the `meta.sensor_id` field and can be used in visualizations. Choose names that are meaningful and unique.
+For example, your sensor names might be "RNDNet Sflow" and "RNDNet Netflow" or "rtr.one.rndnet.edu" and "rtr.two.nrdnet.edu". Whatever makes sense in your situation.
+:::
+
+ - If you don't set a sensor name, the default docker hostname, which changes each time you run the pipeline, will be used.
+ - If you have only one collector, comment out the line for the one you are not using.
+ - If you have more than one of the same type of collector, see the "Docker Advanced" documentation.
+ - If you're not using a netflow or an sflow collector (you are getting only tstat data), then simply disregard the env settings and don't start up either collector.
+
+
+Other settings of note in this file includes the following. You will not necessarily need to change these, but be aware.
+
+**rabbit_output_host**: this defines where the final data will land after going through the pipeline.  Use the default `rabbit` for the local rabbitMQ server, running in its docker container. Enter a hostname to send to a remote rabbitMQ server (also the correct username, password, and queue key/name).
+
+The Logstash Aggregation Filter settings are exposed in case you wish to use different values.
+(See comments in the \*-aggregation.conf file.) This config stitches together long-lasting flows that are seen in multiple nfcapd files, matching by the 5-tuple (source and destination IPs, ports, and protocol) plus sensor name. 
+
+**Aggregation_maps_path**: the name of the file to which logstash will write in-progress aggregation data when logstash shuts down. When logstash starts up again, it will read this file in and resume aggregating. The filename is configurable for complex situations, but /data/ is required.  
+
+**Inactivity_timeout**: If more than inactivity_timeout seconds have passed between the 'start' of a flow and the 'start'
+of the LAST matching flow, OR if no matching flow has coming in for inactivity_timeout seconds
+on the clock, assume the flow has ended.
+
+:::note
+Nfcapd files are typically written every 5 minutes. Netsage uses an inactivity_timeout = 630 sec = 10.5 min for 5-min files; 960 sec = 16 min for 15-min files.  (For 5-min files, this allows one 5 min gap or period during which the no. of bits transferred don't meet the cutoff)
+:::
+
+**max_flow_timeout**: If a long-lasting flow is still aggregating when this timeout is reached, arbitrarily cut it off and start a new flow.  The default is 24 hours.
+
diff --git a/website/versioned_docs/version-1.2.7/components/docker_pipeline.md b/website/versioned_docs/version-1.2.7/components/docker_pipeline.md
@@ -0,0 +1,17 @@
+Start up the pipeline (importers and logstash) using:
+
+```sh
+docker-compose up -d
+```
+
+You can check the logs for each of the container by running
+
+```sh
+docker-compose logs
+```
+
+Shut down the pipeline using:
+
+```sh
+docker-compose down
+```
diff --git a/website/versioned_docs/version-1.2.7/components/docker_upgrade.md b/website/versioned_docs/version-1.2.7/components/docker_upgrade.md
@@ -0,0 +1,30 @@
+
+### Update Source Code
+
+To do a Pipeline upgrade, just reset and pull changes, including the new release, from github. Your non-example env and override files will not be overwritten, but check the new example files to see if there are any updates to copy in.  
+
+```sh
+git reset --hard
+git pull origin master
+```
+
+### Docker and Collectors
+
+Since the collectors live outside of version control, please check the docker-compose.override_example.yml to see if nfdump needs to be updated (eg, `image: netsage/nfdump-collector:1.6.18`). Also check the docker version (eg, `version: "3.7"`) to see if you'll need to ugrade docker.
+
+### Select Release Version
+
+Run these two commands to select the new release you want to run. In the first, replace "tag_value" by the version to run (eg, v1.2.8). When asked by the second, select the same version as the tag you checked out.
+
+```sh
+git checkout "tag_value" 
+./scripts/docker_select_version.sh
+```
+
+### Update docker containers
+
+This applies for both development and release
+
+```
+docker-compose pull
+```
diff --git a/website/versioned_docs/version-1.2.7/deploy/bare_metal_install.md b/website/versioned_docs/version-1.2.7/deploy/bare_metal_install.md
@@ -0,0 +1,286 @@
+---
+id: bare_metal_install
+title: NetSage Flow Processing Pipeline Installation Guide
+sidebar_label: Server Installation Guide
+---
+
+This document covers installing the NetSage Flow Processing Pipeline on a new machine. Steps should be followed below in order unless you know for sure what you are doing. This document assumes a RedHat Linux environment or one of its derivatives.
+
+## Data sources
+
+The Processing pipeline needs data to ingest in order to do anything. There are two types of data that can be consumed.
+
+1. sflow or netflow
+2. tstat
+
+At least one of these must be set up on a sensor to provide the incoming flow data.
+
+Sflow and netflow data should be sent to ports on the pipeline host where nfcapd and/or sfcapd are ready to receive it.
+
+Tstat data should be sent directly to the logstash input RabbitMQ queue (the same one that the Importer writes to, if it is used). From there, the data will be processed the same as sflow/netflow data.
+
+## Installing the Prerequisites
+
+### Installing nfdump
+
+Sflow and netflow data and the NetFlow Importer use nfdump tools. If you are only collecting tstat data, you do not need nfdump. 
+
+nfdump is _not_ listed as a dependency of the Pipeline RPM package, as in a lot cases people are running special builds of nfdump -- but make sure you install it before you try running the Netflow Importer. If in doubt, `yum install nfdump` should work. Flow data exported by some routers require a newer version of nfdump than the one in the CentOS repos; in these cases, it may be necessary to manually compile and install the lastest nfdump.
+
+
+### Installing RabbitMQ
+
+The pipeline requires a RabbitMQ server. Typically, this runs on the same server as the pipeline itself, but if need be, you can separate them (for this reason, the Rabbit server is not automatically installed with the pipeline package).
+
+```sh
+[root@host ~]# yum install rabbitmq-server
+
+```
+
+Typically, the default configuration will work. Perform any desired Rabbit configuration, then, start RabbitMQ:
+
+```sh
+[root@host ~]# /sbin/service rabbitmq-server start 
+          or # systemctl start rabbitmq-server.service
+```
+
+### Installing the EPEL repo
+
+Some of our dependencies come from the EPEL repo. To install this:
+
+```
+[root@host ~]# yum install epel-release
+```
+
+### Installing the GlobalNOC Open Source repo
+
+The Pipeline package (and its dependencies that are not in EPEL) are in the GlobalNOC Open Source Repo.
+
+For Red Hat/CentOS 6, create `/etc/yum.repos.d/grnoc6.repo` with the following content.
+
+```
+[grnoc6]
+name=GlobalNOC Public el6 Packages - $basearch
+baseurl=https://repo-public.grnoc.iu.edu/repo/6/$basearch
+enabled=1
+gpgcheck=1
+gpgkey=https://repo-public.grnoc.iu.edu/repo/RPM-GPG-KEY-GRNOC6
+```
+
+For Red Hat/CentOS 7, create `/etc/yum.repos.d/grnoc7.repo` with the following content.
+
+```
+[grnoc7]
+name=GlobalNOC Public el7 Packages - $basearch
+baseurl=https://repo-public.grnoc.iu.edu/repo/7/$basearch
+enabled=1
+gpgcheck=1
+gpgkey=https://repo-public.grnoc.iu.edu/repo/RPM-GPG-KEY-GRNOC7
+```
+
+The first time you install packages from the repo, you will have to accept the GlobalNOC repo key.
+
+## Installing the Pipeline (Importer and Logstash)
+
+Install it like this:
+
+```
+[root@host ~]# yum install grnoc-netsage-deidentifier
+```
+
+Pipeline components:
+
+1. Flow Filter - GlobalNOC uses this for Cenic data to filter out some flows. Not needed otherwise.
+2. Netsage Netflow Importer - required to read nfcapd files from sflow and netflow importers. (If using tstat flow sensors only, this is not needed.)
+3. Logstash - be sure the number of logstash pipeline workers is set to 1 (unless you have removed the aggregation logstash conf).
+4. Logstash configs - these are executed in alphabetical order.  See the Logstash doc.
+
+Nothing will automatically start after installation as we need to move on to configuration. 
+
+## Importer Configuration
+
+Configuration files of interest are
+ - /etc/grnoc/netsage/deidentifier/netsage_shared.xml - Shared config file allowing configuration of collections, and Rabbit connection information
+ - /etc/grnoc/netsage/deidentifier/netsage_netflow_importer.xml - other settings
+ - /etc/grnoc/netsage/deidentifier/logging.conf - logging config
+ - /etc/grnoc/netsage/deidentifier/logging-debug.conf - logging config with debug enabled
+
+### Setting up the shared config file
+
+`/etc/grnoc/netsage/deidentifier/netsage_shared.xml`
+
+There used to be many perl-based pipeline components and daemons. At this point, only the importer is left, the rest having been replaced by logstash.  The shared config file, which was formerly used by all the perl components, is read before reading the individual importer config file.
+
+The most important part of the shared configuration file is the definition of collections. Each sflow or netflow sensor will have its own collection stanza. Here is one such stanza, a netflow example. Instance and router-address can be left commented out.
+
+```
+<collection>
+     <!-- Top level directory of the nfcapd files for this sensor (within this dir are normally year directories, etc.) -->
+         <flow-path>/path/to/netflow-files/</flow-path>
+
+     <!-- Sensor name - can be the hostname or any string you like -->
+         <sensor>Netflow Sensor 1d</sensor>
+
+     <!-- Flow type - sflow or netflow (defaults to netflow) -->
+         <flow-type>netflow</flow-type>
+
+     <!-- "instance" goes along with sensor.  This is to identify various instances if a sensor has -->
+     <!-- more than one "stream" / data collection.  Defaults to 0. -->
+     <!-- <instance>1</instance> -->
+
+     <!-- Used in Flow-Filter. Defaults to sensor, but you can set it to something else here -->
+     <!-- <router-address></router-address> -->
+</collection>
+```
+
+Having multiple collections in one importer can sometimes cause issues for aggregation, as looping through the collections one at a time adds to the time between the flows, affecting timeouts. You can also set up multiple Importers with differently named shared and importer config files and separate init.d files. 
+
+There is also RabbitMQ connection information in the shared config, though queue names are set in the Importer config. (The Importer does not read from a rabbit queue, but other old components did, so both input and output are set.) 
+
+Ideally, flows should be deidentified before they leave the host on which the data is stored. If flows that have not be deidentified need to be pushed to another node for some reason, the Rabbit connection must be encrypted with SSL.
+
+If you're running a default RabbitMQ config, which is open only to 'localhost' as guest/guest, you won't need to change anything here.
+
+```
+  <!-- rabbitmq connection info -->
+  <rabbit_input>
+    <host>127.0.0.1</host>
+    <port>5672</port>
+    <username>guest</username>
+    <password>guest</password>
+    <ssl>0</ssl>
+    <batch_size>100</batch_size>
+    <vhost>/</vhost>
+    <durable>1</durable> <!-- Whether the rabbit queue is 'durable' (don't change this unless you have a reason) -->
+  </rabbit_input>
+
+  <rabbit_output>
+    <host>127.0.0.1</host>
+    <port>5672</port>
+    <username>guest</username>
+    <password>guest</password>
+    <ssl>0</ssl>
+    <batch_size>100</batch_size>
+    <vhost>/</vhost>
+    <durable>1</durable> <!-- Whether the rabbit queue is 'durable' (don't change this unless you have a reason) -->
+  </rabbit_output>
+```
+
+### Setting up the Importer config file
+
+`/etc/grnoc/netsage/deidentifier/netsage_netflow_importer.xml`
+
+This file has a few more setting specific to the Importer component which you may like to adjust.  
+
+ - Rabbit_output has the name of the output queue. This should be the same as that of the logstash input queue.  
+ - (The Importer does not actually use an input rabbit queue, so we add a "fake" one here.)
+ - Min-bytes is a threshold applied to flows aggregated within one nfcapd file. Flows smaller than this will be discarded.
+ - Min-file-age is used to be sure files are complete before being read. 
+ - Cull-enable and cull-ttl can be used to have nfcapd files older than some number of days automatically deleted. 
+ - Pid-file is where the pid file should be written. Be sure this matches what is used in the init.d file.
+ - Keep num-processes set to 1.
+
+```xml
+<config>
+  <!--  NOTE: Values here override those in the shared config -->
+
+  <!-- rabbitmq queues -->
+  <rabbit_input>
+    <queue>netsage_deidentifier_netflow_fake</queue>
+    <channel>2</channel>
+  </rabbit_input>
+
+  <rabbit_output>
+    <channel>3</channel>
+    <queue>netsage_deidentifier_raw</queue>
+  </rabbit_output>
+
+  <worker>
+    <!-- How many flows to process at once -->
+        <flow-batch-size>100</flow-batch-size>
+
+    <!-- How many concurrent workers should perform the necessary operations -->
+        <num-processes>1</num-processes>
+
+    <!-- path to nfdump executable (defaults to /usr/bin/nfdump) -->
+    <!--   <nfdump-path>/path/to/nfdump</nfdump-path>  -->
+
+    <!-- Where to store the cache, where it tracks what files it has/hasn't read -->
+        <cache-file>/var/cache/netsage/netflow_importer.cache</cache-file>
+
+    <!-- The minium flow size threshold - will not  import any flows smaller than this -->
+    <!-- Defaults to 500M  -->
+        <min-bytes>100000000</min-bytes> 
+
+    <!-- Do not import nfcapd files younger than min-file-age
+        The value must match /^(\d+)([DWMYhms])$/ where D, W, M, Y, h, m and s are
+        "day(s)", "week(s)", "month(s)", "year(s)", "hour(s)", "minute(s)" and "second(s)", respectively"
+        See http://search.cpan.org/~pfig/File-Find-Rule-Age-0.2/lib/File/Find/Rule/Age.pm
+        Default: 0 (no minimum age) 
+    -->
+        <min-file-age>10m</min-file-age> 
+
+    <!-- cull-enable: whether to cull processed flow data files -->
+    <!-- default: no culling; set to 1 to turn culling on -->
+    <!--    <cull-enable>1</cull-enable>  -->
+
+    <!-- cull-tty: cull time to live, in days -->
+    <!-- number of days to retain imported data files before deleting them; default: 3 -->
+    <!--    <cull-ttl>5</cull-ttl>  -->
+  </worker>
+
+  <master>
+    <!-- where should we write the daemon pid file to -->
+        <pid-file>/var/run/netsage-netflow-importer-daemon.pid</pid-file>
+  </master>
+
+</config>
+```
+
+## Logstash Setup Notes
+
+Standard logstash filter config files are provided with this package. Most should be used as-is, but the input and output configs may be modified for your use.
+
+The aggregation filter also has settings that may be changed as well - check the two timeouts and the aggregation maps path. 
+
+When upgrading, these logstash configs will not be overwritten. Be sure any changes get copied into the production configs.
+
+FOR FLOW STITCHING/AGGREGATION - IMPORTANT!
+Flow stitching (ie, aggregation) will NOT work properly with more than ONE logstash pipeline worker!
+Be sure to set "pipeline.workers: 1" in /etc/logstash/logstash.yml and/or /etc/logstash/pipelines.yml. When running logstash on the command line, use "-w 1".
+
+## Start Logstash
+
+```sh
+[root@host ~]# /sbin/service logstash start 
+          or # systemctl start logstash.service
+```
+It will take couple minutes to start. Log files are normally /var/log/messages and /var/log/logstash/logstash-plain.log.
+
+When logstash is stopped, any flows currently "in the aggregator" will be written out to /tmp/logstash-aggregation-maps (or the path/file set in 40-aggregation.conf). These will be read in and deleted when logstash is started again. 
+
+## Start the Importer
+
+Typically, the daemons are started and stopped via init script (CentOS 6) or systemd (CentOS 7). They can also be run manually. The daemons all support these flags:
+
+`--config [file]` - specify which config file to read
+
+`--sharedconfig [file]` - specify which shared config file to read
+
+`--logging [file]` - the logging config
+
+`--nofork` - run in foreground (do not daemonize)
+
+```sh
+[root@host ~]# /sbin/service netsage-netflow-importer start 
+          or # systemctl start netsage-netflow-importer.service
+```
+The Importer will create a deamon process and a worker process. When stopping the service, the worker process might take a few minutes to quit. If it does not quit, kill it by hand. 
+
+
+## Cron jobs
+
+Sample cron files are provided. Please review and uncomment their contents. These periodically download MaxMind, CAIDA, and Science Registry files, and also restart logstash daily. Logstash needs to be restarted in order for any updated files to be read in. 
+
+
+