Skip to content

Commit

Permalink
Update all Python package versions by unpinning them.
Browse files Browse the repository at this point in the history
Also:
* added package `plotnine` per DataBiosphere#126
* replaced package `tensorflow` with `tensorflow_cpu` to get rid of the warnings about GPUs being unavailable for Terra Cloud Runtimes
* added package `google-resumable-media` as an explicit dependency to ensure a more recent version of it is used, pandas-gbq depends on it for table uploads
* `--use_rest_api` flag is now needed for `%%bigquery magic`
  * As of release [google-cloud-bigquery 1.26.0 (2020-07-20)](https://github.com/googleapis/python-bigquery/blob/master/CHANGELOG.md#1260-2020-07-20) the BigQuery Python client uses the BigQuery Storage client by default.
  * This currently causes an error on Terra Cloud Runtimes `the user does not have 'bigquery.readsessions.create' permission for '<Terra billing project id>'`.
  * To work-around this we uninstall the dependency `google-cloud-bigquery-storage` so that flag `--use_rest_api` can be used with `%%bigquery` to use the older, slower mechanism for data transfer.
* add nbstripout to terra-jupyter-aou and enable it globally
* improve test coverage by enabling tests that were intentionally commented out for the prior image
  • Loading branch information
deflaux committed Mar 5, 2021
1 parent 17b8600 commit 8999f74
Show file tree
Hide file tree
Showing 4 changed files with 91 additions and 79 deletions.
12 changes: 5 additions & 7 deletions terra-jupyter-aou/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
FROM us.gcr.io/broad-dsp-gcr-public/terra-jupyter-python:0.0.23 AS python
FROM us.gcr.io/broad-dsp-gcr-public/terra-jupyter-python:0.0.24 AS python

FROM us.gcr.io/broad-dsp-gcr-public/terra-jupyter-r:1.0.13

Expand Down Expand Up @@ -85,9 +85,7 @@ ENV USER jupyter-user
USER $USER

RUN pip3 install --upgrade \
pandas-profiling==2.10.1 \
plotnine==0.7.1 \
# Parent image pins tensorflow to an old alpha version. Override here for now.
tensorflow==2.3.0 \
numpy==1.18.5 \
"git+git://github.com/all-of-us/workbench-snippets.git#egg=terra_widgets&subdirectory=py"
nbstripout \
"git+git://github.com/all-of-us/workbench-snippets.git#egg=terra_widgets&subdirectory=py" \
&& mkdir -p /home/$USER/.config/git \
&& nbstripout --install --global
129 changes: 71 additions & 58 deletions terra-jupyter-python/Dockerfile
Original file line number Diff line number Diff line change
@@ -1,6 +1,6 @@
FROM us.gcr.io/broad-dsp-gcr-public/terra-jupyter-base:0.0.19
USER root
#this makes it so pip runs as root, not the user
# This makes it so pip runs as root, not the user.
ENV PIP_USER=false

RUN apt-get update && apt-get install -yq --no-install-recommends \
Expand All @@ -20,69 +20,82 @@ RUN apt-get update && apt-get install -yq --no-install-recommends \

ENV HTSLIB_CONFIGURE_OPTIONS="--enable-gcs"

# Dev note: in general, do not pin Python packages to any particular version.
# Depend on the smoke tests to help us identify any package incompatibilties.
#
# If we find that we do need to pin a package version, be sure to:
# 1) Add a comment saying what needs to be true for us to remove the pin.
# (e.g. link to an issue and put the details there)
# 2) If the smoke tests did not show the problem, add a new test case to improve
# test coverage for the identified problem.
RUN pip3 -V \
&& pip3 install --upgrade pip \
&& pip3 install numpy==1.15.2 \
&& pip3 install py4j==0.10.7 \
&& python3 -mpip install matplotlib==3.0.0 \
&& pip3 install pandas==0.25.3 \
&& pip3 install pandas-gbq==0.12.0 \
&& pip3 install pandas-profiling==2.4.0 \
&& pip3 install seaborn==0.9.0 \
&& pip3 install python-lzo==1.12 \
&& pip3 install google-cloud-bigquery==1.23.1 \
&& pip3 install google-api-core==1.6.0 \
&& pip3 install google-cloud-bigquery-datatransfer==0.4.1 \
&& pip3 install google-cloud-datastore==1.10.0 \
&& pip3 install google-cloud-resource-manager==0.30.0 \
&& pip3 install google-cloud-storage==1.23.0 \
&& pip3 install scikit-learn==0.20.0 \
&& pip3 install statsmodels==0.9.0 \
&& pip3 install ggplot==0.11.5 \
&& sed -i 's/pandas.lib/pandas/g' /usr/local/lib/python3.7/dist-packages/ggplot/stats/smoothers.py \
# the next few `sed` lines are workaround for a ggplot bug. See https://github.com/yhat/ggpy/issues/662
&& sed -i 's/pandas.tslib.Timestamp/pandas.Timestamp/g' /usr/local/lib/python3.7/dist-packages/ggplot/stats/smoothers.py \
&& sed -i 's/pd.tslib.Timestamp/pd.Timestamp/g' /usr/local/lib/python3.7/dist-packages/ggplot/stats/smoothers.py \
&& sed -i 's/pd.tslib.Timestamp/pd.Timestamp/g' /usr/local/lib/python3.7/dist-packages/ggplot/utils.py \
&& pip3 install bokeh==1.0.0 \
&& pip3 install pyfasta==0.5.2 \
&& pip3 install markdown==2.4.1 \
&& pip3 install pdoc3==0.7.2 \
&& pip3 install biopython==1.72 \
&& pip3 install bx-python==0.8.2 \
&& pip3 install fastinterval==0.1.1 \
&& pip3 install matplotlib-venn==0.11.5 \
&& pip3 install bleach==1.5.0 \
&& pip3 install cycler==0.10.0 \
&& pip3 install h5py==2.7.1 \
&& pip3 install html5lib==0.9999999 \
&& pip3 install joblib==0.11 \
&& pip3 install keras==2.1.6 \
&& pip3 install patsy==0.4.1 \
&& pip3 install protobuf==3.7.1 \
&& pip3 install pymc3==3.10.0 \
&& pip3 install pyparsing==2.2.0 \
&& pip3 install numpy \
&& pip3 install py4j \
&& python3 -mpip install matplotlib \
&& pip3 install pandas \
&& pip3 install pandas-gbq \
&& pip3 install pandas-profiling \
&& pip3 install seaborn \
&& pip3 install python-lzo \
&& pip3 install google-cloud-bigquery \
&& pip3 install google-api-core \
&& pip3 install google-cloud-bigquery-datatransfer \
&& pip3 install google-cloud-datastore \
&& pip3 install google-cloud-resource-manager \
&& pip3 install google-cloud-storage \
&& pip3 install scikit-learn \
&& pip3 install statsmodels \
&& pip3 install ggplot \
&& pip3 install bokeh \
&& pip3 install pyfasta \
&& pip3 install markdown \
&& pip3 install pdoc3 \
&& pip3 install biopython \
&& pip3 install bx-python \
&& pip3 install fastinterval \
&& pip3 install matplotlib-venn \
&& pip3 install bleach \
&& pip3 install cycler \
&& pip3 install h5py \
&& pip3 install html5lib \
&& pip3 install joblib \
&& pip3 install keras \
&& pip3 install patsy \
&& pip3 install protobuf \
&& pip3 install pymc3 \
&& pip3 install pyparsing \
&& pip3 install Cython \
&& pip3 install pysam==0.15.4 --no-binary pysam \
&& pip3 install python-dateutil==2.6.1 \
&& pip3 install pytz==2017.3 \
&& pip3 install pyvcf==0.6.8 \
&& pip3 install pyyaml==5.3.1 \
&& pip3 install scipy==1.2 \
&& pip3 install tensorflow==2.0.0a0 \
&& pip3 install theano==0.9.0 \
&& pip3 install tqdm==4.19.4 \
&& pip3 install werkzeug==0.12.2 \
&& pip3 install certifi==2017.4.17 \
&& pip3 install intel-openmp==2018.0.0 \
&& pip3 install mkl==2018.0.3 \
&& pip3 install readline==6.2 \
&& pip3 install setuptools==42.0.2 \
&& pip3 install wheel
&& pip3 install pysam --no-binary pysam \
&& pip3 install python-dateutil \
&& pip3 install pytz \
&& pip3 install pyvcf \
&& pip3 install pyyaml \
&& pip3 install scipy \
# Use the cpu version of Tensorflow to eliminate the warnings about absent gpus on the Cloud Runtime.
&& pip3 install tensorflow_cpu \
&& pip3 install theano \
&& pip3 install tqdm \
&& pip3 install werkzeug \
&& pip3 install certifi \
&& pip3 install intel-openmp \
&& pip3 install mkl \
&& pip3 install readline \
&& pip3 install setuptools \
&& pip3 install wheel \
&& pip3 install plotnine \
&& pip3 install google-resumable-media \
# Remove this after https://broadworkbench.atlassian.net/browse/CA-1179
# As of release [google-cloud-bigquery 1.26.0 (2020-07-20)](https://github.com/googleapis/python-bigquery/blob/master/CHANGELOG.md#1260-2020-07-20)
# the BigQuery Python client uses the BigQuery Storage client by default.
# This currently causes an error on Terra Cloud Runtimes `the user does not have 'bigquery.readsessions.create' permission
# for '<Terra billing project id>'`. To work-around this uninstall the dependency so that flag `--use_rest_api` can be used
# with `%%bigquery` to use the older, slower mechanism for data transfer.
&& pip3 uninstall -y google-cloud-bigquery-storage

ENV USER jupyter-user
USER $USER
#we want pip to install into the user's dir when the notebook is running
# We want pip to install into the user's dir when the notebook is running.
ENV PIP_USER=true

# Note: this entrypoint is provided for running Jupyter independently of Leonardo.
Expand Down
24 changes: 13 additions & 11 deletions terra-jupyter-python/tests/smoke_test.ipynb
Original file line number Diff line number Diff line change
Expand Up @@ -60,9 +60,11 @@
"source": [
"## Test BigQuery magic\n",
"\n",
"TODO(deflaux) after we update the BigQuery Python client package, be sure to explicitly use flag `--use_rest_api` with `%%bigquery`\n",
"* As of release [google-cloud-bigquery 1.26.0 (2020-07-20)](https://github.com/googleapis/python-bigquery/blob/master/CHANGELOG.md#1260-2020-07-20) the BigQuery Python client uses the BigQuery Storage client by default.\n",
"* This currently causes an error on Terra Cloud Runtimes `the user does not have 'bigquery.readsessions.create' permission for '<Terra billing project id>'`."
"* This currently causes an error on Terra Cloud Runtimes `the user does not have 'bigquery.readsessions.create' permission for '<Terra billing project id>'`.\n",
"* To work around this, we do two things:\n",
" 1. remove the dependency `google-cloud-bigquery-storage` from the `terra-jupyter-python` image\n",
" 1. use flag `--use_rest_api` with `%%bigquery`"
]
},
{
Expand All @@ -80,7 +82,7 @@
"metadata": {},
"outputs": [],
"source": [
"%%bigquery\n",
"%%bigquery --use_rest_api\n",
"\n",
"SELECT country_name, alpha_2_code\n",
"FROM `bigquery-public-data.utility_us.country_code_iso`\n",
Expand All @@ -92,14 +94,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test pandas profiling\n",
"\n",
"TODO(deflaux) its a known issue that pandas-profiler is broken in the current image. Enable this test after we update the package version."
"## Test pandas profiling"
]
},
{
"cell_type": "raw",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"import numpy as np\n",
"import pandas as pd\n",
Expand Down Expand Up @@ -162,14 +164,14 @@
"cell_type": "markdown",
"metadata": {},
"source": [
"## Test plotnine\n",
"\n",
"TODO(deflaux) enable this as part of https://github.com/DataBiosphere/terra-docker/issues/126"
"## Test plotnine"
]
},
{
"cell_type": "raw",
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": [
"from plotnine import ggplot, geom_point, aes, stat_smooth, facet_wrap\n",
"from plotnine.data import mtcars\n",
Expand Down
5 changes: 2 additions & 3 deletions terra-jupyter-python/tests/smoke_test.py
Original file line number Diff line number Diff line change
Expand Up @@ -16,13 +16,12 @@ def test_pandas():

pd.DataFrame(
{
# TODO(deflaux) uncomment "A" and "F" after the pandas version upgrade.
# "A": 1.0,
"A": 1.0,
"B": pd.Timestamp("20130102"),
"C": pd.Series(1, index=list(range(4)), dtype="float32"),
"D": np.array([3] * 4, dtype="int32"),
"E": pd.Categorical(["test", "train", "test", "train"]),
# "F": "foo",
"F": "foo",
}
)

Expand Down

0 comments on commit 8999f74

Please sign in to comment.