Skip to content

Commit

Permalink
Merge branch 'release/0.3.13'
Browse files Browse the repository at this point in the history
  • Loading branch information
tomcis committed Feb 4, 2021
2 parents 89add7f + dfb1a51 commit 830b436
Show file tree
Hide file tree
Showing 15 changed files with 54 additions and 29 deletions.
2 changes: 1 addition & 1 deletion .pre-commit-config.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ repos:
- id: flake8
args: [ "--select=E9,F63,F7,F82"]
- repo: https://github.com/asottile/pyupgrade
rev: v2.7.4
rev: v2.9.0
hooks:
- id: pyupgrade
args: ['--py36-plus','--exit-zero-even-if-changed']
6 changes: 6 additions & 0 deletions CHANGES.rst
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,12 @@
Release notes
=============

Version 0.3.13, Feb 2021
------------------------
* ``Spark 3.0`` support (``histogrammar`` update) (#87)
* Improved documentation
* Few minor package improvements

Version 0.3.12, Jan 2021
------------------------
* Add proper check of matrix invertibility of covariance matrix in stats/numpy.py
Expand Down
3 changes: 3 additions & 0 deletions MANIFEST.in
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
include requirements.txt
include LICENSE
include NOTICE
17 changes: 17 additions & 0 deletions README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -19,11 +19,28 @@ using monitoring business rules.

|example|

Announcements
=============

Spark 3.0
---------

With Spark 3.0, based on Scala 2.12, make sure to pick up the correct `histogrammar` jar file:

.. code-block:: python
spark = SparkSession.builder.config("spark.jars.packages", "io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11").getOrCreate()
For Spark 2.X compiled against scala 2.11, in the string above simply replace 2.12 with 2.11.

`January 29, 2021`

Documentation
=============

The entire `popmon` documentation including tutorials can be found at `read-the-docs <https://popmon.readthedocs.io>`_.


Examples
========

Expand Down
35 changes: 20 additions & 15 deletions docs/source/configuration.rst
Original file line number Diff line number Diff line change
Expand Up @@ -12,34 +12,38 @@ Reference types
When generating a report from a DataFrame, the reference type can be set with the option ``reference_type``,
in four different ways:

1. Using the DataFrame on which the stability report is built as a self-reference. (This is the default setting.)
1. Using the DataFrame on which the stability report is built as a self-reference. This reference method is static: each time slot is compared to all the slots in the DataFrame (all included in one distribution). This is the default reference setting.

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="self")
2. Using an external reference DataFrame or set of histograms:
2. Using an external reference DataFrame or set of histograms. This is also a static method: each time slot is compared to all the time slots in the reference data.

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="external", reference=reference)
3. Using a rolling window as reference, by default the 10 preceding time slots:
3. Using a rolling window within the same DataFrame as reference. This method is dynamic: we can set the size of the window and the shift from the current time slot. By default the 10 preceding time slots are used as reference (shift=1, window_size=10).

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="rolling", window=10, shift=1)
4. Using an expanding window of all preceding time slots:
4. Using an expanding window on all preceding time slots within the same DataFrame. This is also a dynamic method, with variable window size. All the available previous time slots are used. For example, if we have 2 time slots available and shift=1, window size will be 1 (so the previous slot is the reference), while if we have 10 time slots and shift=1, window size will be 9 (and all previous time slots are reference).

.. code-block:: python
# generate stability report with specific monitoring rules
report = df.pm_stability_report(reference_type="expanding", shift=1)
Note that, by default, popmon also performs a rolling comparison of the histograms in each time period with those in the
previous time period. The results of these comparisons contain the term "prev1", and are found in the comparisons section
of a report.


Binning specifications
----------------------
Expand All @@ -53,6 +57,7 @@ To specify the time-axis binning alone, do:
report = df.pm_stability_report(time_axis='date', time_width='1w', time_offset='2020-1-6')
The default time width is 30 days ('30d'), with time offset 2010-1-4 (a Monday).
All other features (except for 'date') are auto-binned in this example.

To specify your own binning specifications for individual features or combinations of features, do:
Expand Down Expand Up @@ -195,16 +200,16 @@ Spark usage
.. code-block:: python
import popmon
from pyspark.sql import SparkSession
from pyspark.sql import SparkSession
# downloads histogrammar jar files if not already installed, used for histogramming of spark dataframe
spark = SparkSession.builder.config('spark.jars.packages','org.diana-hep:histogrammar-sparksql_2.11:1.0.4').getOrCreate()
# downloads histogrammar jar files if not already installed, used for histogramming of spark dataframe
spark = SparkSession.builder.config('spark.jars.packages','io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11').getOrCreate()
# load a dataframe
spark_df = spark.read.format('csv').options(header='true').load('file.csv')
# load a dataframe
spark_df = spark.read.format('csv').options(header='true').load('file.csv')
# generate the report
report = spark_df.pm_stability_report(time_axis='timestamp')
# generate the report
report = spark_df.pm_stability_report(time_axis='timestamp')
Spark example on Google Colab
Expand All @@ -216,8 +221,8 @@ This snippet contains the instructions for setting up a minimal environment for
!apt-get install openjdk-8-jdk-headless -qq > /dev/null
!wget -q https://www-us.apache.org/dist/spark/spark-2.4.7/spark-2.4.7-bin-hadoop2.7.tgz
!tar xf spark-2.4.7-bin-hadoop2.7.tgz
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/org/diana-hep/histogrammar-sparksql_2.11/1.0.4/histogrammar-sparksql_2.11-1.0.4.jar
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/org/diana-hep/histogrammar_2.11/1.0.4/histogrammar_2.11-1.0.4.jar
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar-sparksql_2.12/1.0.11/histogrammar-sparksql_2.12-1.0.11.jar
!wget -P /content/spark-2.4.7-bin-hadoop2.7/jars/ -q https://repo1.maven.org/maven2/io/github/histogrammar/histogrammar_2.12/1.0.11/histogrammar_2.12-1.0.11.jar
!pip install -q findspark popmon
Now that spark is installed, restart the runtime.
Expand All @@ -234,7 +239,7 @@ Now that spark is installed, restart the runtime.
from pyspark.sql import SparkSession
spark = SparkSession.builder.master("local[*]") \
.config("spark.jars", "/content/jars/histogrammar_2.11-1.0.4.jar,/content/jars/histogrammar-sparksql_2.11-1.0.4.jar") \
.config("spark.jars", "/content/jars/histogrammar_2.12-1.0.11.jar,/content/jars/histogrammar-sparksql_2.12-1.0.11.jar") \
.config("spark.sql.execution.arrow.enabled", "false") \
.config("spark.sql.session.timeZone", "GMT") \
.getOrCreate()
.getOrCreate()
2 changes: 1 addition & 1 deletion docs/source/introduction.rst
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ We define the normalized residual of a value of interest with respect to the sel
This quantity is known as the "pull" of the value. The pull is calculated for every profile
of every feature. The size of the pull is used in `popmon`
to flag any significant differences over time with respect to the reference.
to flag any significant differences over time with respect to the reference. Note that you need to have at least two time slots in the reference in order to calculate the pull.

We use traffic lights to indicate where large deviations from the reference occur.
To see how these work, consider the following example.
Expand Down
6 changes: 0 additions & 6 deletions popmon/hist/filling/spark_histogrammar.py
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,6 @@
"""

import histogrammar as hg
import histogrammar.sparksql
import numpy as np
from tqdm import tqdm

Expand Down Expand Up @@ -189,8 +188,6 @@ def process_features(self, df, cols_by_type):
to_ns = sparkcol(col).cast("timestamp").cast("float") * 1e9
idf = idf.withColumn(col, to_ns)

hg.sparksql.addMethods(idf)

return idf

def construct_empty_hist(self, df, features):
Expand Down Expand Up @@ -218,9 +215,6 @@ def construct_empty_hist(self, df, features):

hist = self.get_hist_bin(hist, features, quant, col, dt)

# set data types in histogram
dta = [self.var_dtype[col] for col in features]
hist.datatype = dta[0] if len(features) == 1 else dta
return hist

def fill_histograms(self, idf):
Expand Down
2 changes: 1 addition & 1 deletion popmon/notebooks/popmon_tutorial_advanced.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -162,7 +162,7 @@
"source": [
"if pyspark_installed:\n",
" spark = SparkSession.builder.config(\n",
" \"spark.jars.packages\", \"org.diana-hep:histogrammar-sparksql_2.11:1.0.4\"\n",
" \"spark.jars.packages\", \"io.github.histogrammar:histogrammar-sparksql_2.12:1.0.11\"\n",
" ).getOrCreate()\n",
"\n",
" sdf = spark.createDataFrame(df)\n",
Expand Down
4 changes: 2 additions & 2 deletions popmon/version.py
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
"""THIS FILE IS AUTO-GENERATED BY SETUP.PY."""

name = "popmon"
version = "0.3.12"
full_version = "0.3.12"
version = "0.3.13"
full_version = "0.3.13"
release = True
2 changes: 1 addition & 1 deletion setup.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

MAJOR = 0
REVISION = 3
PATCH = 12
PATCH = 13
DEV = False
# NOTE: also update version at: README.rst

Expand Down
Binary file not shown.
Binary file not shown.
4 changes: 2 additions & 2 deletions tests/popmon/hist/test_spark_histogrammar.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ def get_spark():

current_path = dirname(abspath(__file__))

hist_spark_jar = join(current_path, "jars/histogrammar-sparksql_2.11-1.0.4.jar")
hist_jar = join(current_path, "jars/histogrammar_2.11-1.0.4.jar")
hist_spark_jar = join(current_path, "jars/histogrammar-sparksql_2.11-1.0.11.jar")
hist_jar = join(current_path, "jars/histogrammar_2.11-1.0.11.jar")

spark = (
SparkSession.builder.master("local")
Expand Down

0 comments on commit 830b436

Please sign in to comment.