Merge branch 'release/0.3.13'

ing-bank · Feb 4, 2021 · 830b436 · 830b436
2 parents 89add7f + dfb1a51
commit 830b436
Show file tree

Hide file tree

Showing 15 changed files with 54 additions and 29 deletions.
diff --git a/.pre-commit-config.yaml b/.pre-commit-config.yaml
@@ -15,7 +15,7 @@ repos:
     -   id: flake8
         args: [ "--select=E9,F63,F7,F82"]
 -   repo: https://github.com/asottile/pyupgrade
-    rev: v2.7.4
+    rev: v2.9.0
     hooks:
     -   id: pyupgrade
         args: ['--py36-plus','--exit-zero-even-if-changed']
diff --git a/CHANGES.rst b/CHANGES.rst
@@ -2,6 +2,12 @@
 Release notes
 =============
 
+Version 0.3.13, Feb 2021
+------------------------
+* ``Spark 3.0`` support (``histogrammar`` update) (#87)
+* Improved documentation
+* Few minor package improvements
+
 Version 0.3.12, Jan 2021
 ------------------------
 * Add proper check of matrix invertibility of covariance matrix in stats/numpy.py

diff --git a/MANIFEST.in b/MANIFEST.in
@@ -0,0 +1,3 @@
+include requirements.txt
+include LICENSE
+include NOTICE
diff --git a/README.rst b/README.rst
@@ -19,11 +19,28 @@ using monitoring business rules.
 
 |example|
 
+Announcements
+=============
+
+Spark 3.0
+---------
+
+With Spark 3.0, based on Scala 2.12, make sure to pick up the correct `histogrammar` jar file:
+
+.. code-block:: python
+
+  spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11").getOrCreate()
+
+For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.
+
+`January 29, 2021`
+
 Documentation
 =============
 
 The entire `popmon` documentation including tutorials can be found at `read-the-docs <https://popmon.readthedocs.io>`_.
 
+
 Examples
 ========
 

diff --git a/docs/source/configuration.rst b/docs/source/configuration.rst
@@ -12,34 +12,38 @@ Reference types
 When generating a report from a DataFrame, the reference type can be set with the option ``reference_type``,
 in four different ways:
 
-1. Using the DataFrame on which the stability report is built as a self-reference. (This is the default setting.)
+1. Using the DataFrame on which the stability report is built as a self-reference. This reference method is static: each time slot is compared to all the slots in the DataFrame (all included in one distribution). This is the default reference setting.
 
     .. code-block:: python
 
       # generate stability report with specific monitoring rules
       report = df.pm_stability_report(reference_type="self")
 
-2. Using an external reference DataFrame or set of histograms:
+2. Using an external reference DataFrame or set of histograms. This is also a static method: each time slot is compared to all the time slots in the reference data.
 
     .. code-block:: python
 
       # generate stability report with specific monitoring rules
       report = df.pm_stability_report(reference_type="external", reference=reference)
 
-3. Using a rolling window as reference, by default the 10 preceding time slots:
+3. Using a rolling window within the same DataFrame as reference. This method is dynamic: we can set the size of the window and the shift from the current time slot. By default the 10 preceding time slots are used as reference (shift=1, window_size=10).
 
     .. code-block:: python
 
       # generate stability report with specific monitoring rules
       report = df.pm_stability_report(reference_type="rolling", window=10, shift=1)
 
-4. Using an expanding window of all preceding time slots:
+4. Using an expanding window on all preceding time slots within the same DataFrame. This is also a dynamic method, with variable window size. All the available previous time slots are used. For example, if we have 2 time slots available and shift=1, window size will be 1 (so the previous slot is the reference), while if we have 10 time slots and shift=1, window size will be 9 (and all previous time slots are reference).
 
     .. code-block:: python
 
       # generate stability report with specific monitoring rules
       report = df.pm_stability_report(reference_type="expanding", shift=1)
 
+Note that, by default, popmon also performs a rolling comparison of the histograms in each time period with those in the
+previous time period. The results of these comparisons contain the term "prev1", and are found in the comparisons section
+of a report.
+
 
 Binning specifications
 ----------------------
@@ -53,6 +57,7 @@ To specify the time-axis binning alone, do:
 
   report = df.pm_stability_report(time_axis='date', time_width='1w', time_offset='2020-1-6')
 
+The default time width is 30 days ('30d'), with time offset 2010-1-4 (a Monday).
 All other features (except for 'date') are auto-binned in this example.
 
 To specify your own binning specifications for individual features or combinations of features, do:
@@ -195,16 +200,16 @@ Spark usage
 .. code-block:: python
 
     import popmon
-	from pyspark.sql import SparkSession
+    from pyspark.sql import SparkSession
 
-	# downloads histogrammar jar files if not already installed, used for histogramming of spark dataframe
-	spark = SparkSession.builder.config('spark.jars.packages','org.diana-hep:histogrammar-sparksql_2.11:1.0.4').getOrCreate()
+    # downloads histogrammar jar files if not already installed, used for histogramming of spark dataframe
+    spark = SparkSession.builder.config('spark.jars.packages','io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11').getOrCreate()
 
-	# load a dataframe
-	spark_df = spark.read.format('csv').options(header='true').load('file.csv')
+    # load a dataframe
+    spark_df = spark.read.format('csv').options(header='true').load('file.csv')
 
-	# generate the report
-	report = spark_df.pm_stability_report(time_axis='timestamp')
+    # generate the report
+    report = spark_df.pm_stability_report(time_axis='timestamp')
 
 
 Spark example on Google Colab
@@ -216,8 +221,8 @@ This snippet contains the instructions for setting up a minimal environment for
     !apt-get install openjdk-8-jdk-headless -qq > /dev/null
     !wget -q https://www-us.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
     !tar xf spark-2.4.7-bin-hadoop2.7.tgz
-    !wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/org/diana-hep/histogrammar-sparksql_2.11/1.0.4/histogrammar-sparksql_2.11-1.0.4.jar
-    !wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/org/diana-hep/histogrammar_2.11/1.0.4/histogrammar_2.11-1.0.4.jar
+    !wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar-sparksql_2.12/1.0.11/histogrammar-sparksql_2.12-1.0.11.jar
+    !wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar_2.12/1.0.11/histogrammar_2.12-1.0.11.jar
     !pip install -q findspark popmon
 
 Now that spark is installed, restart the runtime.
@@ -234,7 +239,7 @@ Now that spark is installed, restart the runtime.
     from pyspark.sql import SparkSession
 
     spark = SparkSession.builder.master("local[*]") \
-      .config("spark.jars", "/content/jars/histogrammar_2.11-1.0.4.jar,/content/jars/histogrammar-sparksql_2.11-1.0.4.jar") \
+      .config("spark.jars", "/content/jars/histogrammar_2.12-1.0.11.jar,/content/jars/histogrammar-sparksql_2.12-1.0.11.jar") \
       .config("spark.sql.execution.arrow.enabled", "false") \
       .config("spark.sql.session.timeZone", "GMT") \
-      .getOrCreate()
+      .getOrCreate()
diff --git a/docs/source/introduction.rst b/docs/source/introduction.rst
@@ -29,7 +29,7 @@ We define the normalized residual of a value of interest with respect to the sel
 
 This quantity is known as the "pull" of the value. The pull is calculated for every profile
 of every feature. The size of the pull is used in `popmon`
-to flag any significant differences over time with respect to the reference.
+to flag any significant differences over time with respect to the reference. Note that you need to have at least two time slots in the reference in order to calculate the pull. 
 
 We use traffic lights to indicate where large deviations from the reference occur.
 To see how these work, consider the following example.

diff --git a/popmon/hist/filling/spark_histogrammar.py b/popmon/hist/filling/spark_histogrammar.py
@@ -7,7 +7,6 @@
 """
 
 import histogrammar as hg
-import histogrammar.sparksql
 import numpy as np
 from tqdm import tqdm
 
@@ -189,8 +188,6 @@ def process_features(self, df, cols_by_type):
             to_ns = sparkcol(col).cast("timestamp").cast("float") * 1e9
             idf = idf.withColumn(col, to_ns)
 
-        hg.sparksql.addMethods(idf)
-
         return idf
 
     def construct_empty_hist(self, df, features):
@@ -218,9 +215,6 @@ def construct_empty_hist(self, df, features):
 
             hist = self.get_hist_bin(hist, features, quant, col, dt)
 
-        # set data types in histogram
-        dta = [self.var_dtype[col] for col in features]
-        hist.datatype = dta[0] if len(features) == 1 else dta
         return hist
 
     def fill_histograms(self, idf):

diff --git a/popmon/notebooks/popmon_tutorial_advanced.ipynb b/popmon/notebooks/popmon_tutorial_advanced.ipynb
@@ -162,7 +162,7 @@
    "source": [
     "if pyspark_installed:\n",
     "    spark = SparkSession.builder.config(\n",
-    "        \"spark.jars.packages\", \"org.diana-hep:histogrammar-sparksql_2.11:1.0.4\"\n",
+    "        \"spark.jars.packages\", \"io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11\"\n",
     "    ).getOrCreate()\n",
     "\n",
     "    sdf = spark.createDataFrame(df)\n",

diff --git a/popmon/version.py b/popmon/version.py
@@ -1,6 +1,6 @@
 """THIS FILE IS AUTO-GENERATED BY SETUP.PY."""
 
 name = "popmon"
-version = "0.3.12"
-full_version = "0.3.12"
+version = "0.3.13"
+full_version = "0.3.13"
 release = True
diff --git a/setup.py b/setup.py
@@ -4,7 +4,7 @@
 
 MAJOR = 0
 REVISION = 3
-PATCH = 12
+PATCH = 13
 DEV = False
 # NOTE: also update version at: README.rst
 

diff --git a/...jars/histogrammar-sparksql_2.11-1.0.4.jar → ...ars/histogrammar-sparksql_2.11-1.0.11.jar b/...jars/histogrammar-sparksql_2.11-1.0.4.jar → ...ars/histogrammar-sparksql_2.11-1.0.11.jar
diff --git a/tests/popmon/hist/jars/histogrammar-sparksql_2.12-1.0.11.jar b/tests/popmon/hist/jars/histogrammar-sparksql_2.12-1.0.11.jar
diff --git a/...mon/hist/jars/histogrammar_2.11-1.0.4.jar → ...on/hist/jars/histogrammar_2.11-1.0.11.jar b/...mon/hist/jars/histogrammar_2.11-1.0.4.jar → ...on/hist/jars/histogrammar_2.11-1.0.11.jar
diff --git a/tests/popmon/hist/jars/histogrammar_2.12-1.0.11.jar b/tests/popmon/hist/jars/histogrammar_2.12-1.0.11.jar
diff --git a/tests/popmon/hist/test_spark_histogrammar.py b/tests/popmon/hist/test_spark_histogrammar.py
@@ -21,8 +21,8 @@ def get_spark():
 
     current_path = dirname(abspath(__file__))
 
-    hist_spark_jar = join(current_path, "jars/histogrammar-sparksql_2.11-1.0.4.jar")
-    hist_jar = join(current_path, "jars/histogrammar_2.11-1.0.4.jar")
+    hist_spark_jar = join(current_path, "jars/histogrammar-sparksql_2.11-1.0.11.jar")
+    hist_jar = join(current_path, "jars/histogrammar_2.11-1.0.11.jar")
 
     spark = (
         SparkSession.builder.master("local")