Skip to content

Commit

Permalink
Merge pull request #487 from aistabci/devel-202408
Browse files Browse the repository at this point in the history
Devel 202408
  • Loading branch information
ttakayuki authored Aug 30, 2024
2 parents df047b1 + b8eb21d commit 41292fa
Show file tree
Hide file tree
Showing 26 changed files with 355 additions and 8 deletions.
5 changes: 5 additions & 0 deletions en/docs/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ The following is a list of CUDA Toolkit, cuDNN, and NCCL that can be used with t
| cuda/12.4 | 12.4.0 | Yes | Yes | Yes |
| cuda/12.4 | 12.4.1 | Yes | Yes | Yes |
| cuda/12.5 | 12.5.0 | Yes | Yes | Yes |
| cuda/12.5 | 12.5.1 | Yes | Yes | Yes |

[^1]: Provided only for experimental use. Rocky Linux 8.6 is supported with CUDA 11.7.1 or later.

Expand All @@ -49,6 +50,7 @@ Compute Node (V):
| 8.9.7 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.0.0[^2] | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.1.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.2.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |

Compute Node (A):

Expand All @@ -63,6 +65,7 @@ Compute Node (A):
| 8.9.7 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.0.0[^2] | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.1.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.2.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |

[^2]: We have confirmed that when cuDNN 9.0.0 is used with CUDA 11.0 to CUDA 11.3, an error occurs when calling the `cudnnRNNBackwardWeights_v8` function.

Expand All @@ -84,6 +87,7 @@ Compute Node (V):
| 2.19.3-1 | - | - | - | - | - | Yes | Yes | - | - |
| 2.20.5-1 | - | - | - | - | - | Yes | - | Yes | - |
| 2.21.5-1 | - | - | - | - | - | Yes | - | Yes | Yes |
| 2.22.3-1 | - | - | - | - | - | Yes | - | Yes | Yes |

Compute Node (A):

Expand All @@ -101,6 +105,7 @@ Compute Node (A):
| 2.19.3-1 | - | - | - | - | - | Yes | Yes | - | - |
| 2.20.5-1 | - | - | - | - | - | Yes | - | Yes | - |
| 2.21.5-1 | - | - | - | - | - | Yes | - | Yes | Yes |
| 2.22.3-1 | - | - | - | - | - | Yes | - | Yes | Yes |

## GDRCopy

Expand Down
58 changes: 58 additions & 0 deletions en/docs/open-ondemand/aihub.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,58 @@
# AI Hub

## Overview

AI Hub is a collection of tools and services for reusing large-scale pre-trained models on the ABCI, and from Open OnDemand, you can use the `App for MLflow Server`.

The `App for MLflow Server` is an application that deploys the MLflow Tracking Server, an experiment management tool, in a way that allows it to be used on an ABCI group basis and managed from a web UI.

The deployed MLflow Tracking Server can be used by teams for recording and sharing training histories and training models in model development from the compute nodes of the ABCI or Jupyter Lab in Open OnDemand.

!!! caution
The `App for MLflow Server` is released as an experimental feature.
The service may change without notice, and responses to inquiries may take some time.

## Prerequisites

* An ABCI Cloud Storage bucket and an access key (When creating an MLflow Tracking Server)
* Please refer to [How to Use ABCI Cloud Storage](../abci-cloudstorage/usage.md) for the creation method.

## Using AI Hub

To start the `App for MLflow Server`, click `AI Hub` and then `MLflow Server` from the menu.

When you start the `App for MLflow Server`, the following screen will be displayed.

![Screenshot of App for MLflow Server](img/app_for_mlflow_server.png){width=640}

#### Creating MLflow Tracking Server

* Based on the screen instructions, enter the following items and click the `Create Service` button.

| Item | Description |
| -- | -- |
| `group_name` | ABCI Group |
| `env_name` | Environment Name |
| `cloud_storage_bucket_name` | Bucket Name |
| `cloud_storage_accesskey_ID` | Access Key ID |
| `cloud_storage_secret_accesskey` | Secret Access Key ID |

* Upon successful creation of the service, the "Operational status for requests" section will display "Service created".

#### Using MLflow Tracking Server

* Click the `Service List Update` button to display a list of available "Service List".
* You can start, stop, or delete services by operating the buttons under "Control Service".
* The status of the operation will be displayed in the "Operational status for requests" section.
* Please stop or delete services when they are no longer needed to conserve resources.
* If you need to configure Basic Authentication for the "MLflow Tracking Server", click the `Update Auth Info` button for the service.
* You need to have a YAML file in a specified location beforehand in the following format.

`{'user_name':'<username for Basic Authentication>', 'pass':'<password for Basic Authentication>'}`

* To access the MLflow UI, click on the URL under `URL for access from outside ABCI`.
* Enter your Basic Authentication username and password to log in.
* Please use the running MLflow Tracking Server.
* It can be accessed from the HPC Cluster's job services or Jupyter Lab in Open OnDemand.
* By specifying `URL for access from inside ABCI` as the MLflow API tracking URI, you can record AI model training histories and models in the model registry.
* For specific usage of MLflow Tracking Server, please refer to the [MLflow documentation](https://mlflow.org/docs/latest/index.html).
Binary file added en/docs/open-ondemand/images.pptx
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added en/docs/open-ondemand/img/email-otp.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added en/docs/open-ondemand/img/login.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added en/docs/open-ondemand/img/ondemand-top-page.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
52 changes: 52 additions & 0 deletions en/docs/open-ondemand/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,52 @@
# Open OnDemand

## Overview

[Open OnDemand (OOD)](https://openondemand.org/) is a portal site for using ABCI through a web browser.

The following features is available on the web browser, making it easier to use ABCI than ever before:

* Console operations on interactive nodes
* File operations in the home areas and the group areas
* Use of web applications such as Jupyter Lab

!!! caution
Open OnDemand is released as an experimental feature.
The service may change without notice, and responses to inquiries may take some time.

## Login

To log in to the Open OnDemand, first open the URL [https://ood-portal.abci.ai/](https://ood-portal.abci.ai/).
After accessing `ood-portal.abci.ai`, you will be prompted to enter your username and password.
Please enter the username and password you set up on [the ABCI User Portal](https://portal.abci.ai/).

[![Input your username and password](img/login.png){width=640}](img/login.png)

After authenticating with your username and password, you will be asked to enter an access code.
The access code will be sent to your registered email address, so please enter the access code into the input form after receiving it.

[![Input the access code](img/email-otp.png){width=640}](img/email-otp.png)

After authenticating with the access code, you will be logged in to the Open OnDemand.

[![Open OnDemand top page](img/ondemand-top-page.png){width=640}](img/ondemand-top-page.png)

!!! warning
If an error occurs during login, please [contact](../contact.md) the administrator.


## Applications

You can access the features provided by the Open OnDemand from the menu at the top of the screen.

[![Open OnDemand Application Menu](ood-menu.png)](ood-menu.png)

1. **Files**: Perform file operations in the browser.

2. **Jobs**: Edit and manage jobs in the browser.

3. **Clusters**: Open the console for the interactive nodes.

4. **Interactive Apps**: Launch web applications on the compute nodes and transfer the screen to the web browser.<br>For details, please refer to [Interactive Apps](interactive-apps.md).

5. **AI Hub**: AI Hub is a collection of tools and services for reusing large-scale pre-trained models on the ABCI. It provides an application to manage the deployment of the Mlflow Tracking Server, one of the features that constitute AI Hub.<br>For details, please refer to [AI Hub](aihub.md).
30 changes: 30 additions & 0 deletions en/docs/open-ondemand/interactive-apps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Interactive Apps

Interactive apps are applications that run on the ABCI compute nodes and can be interactively operated in the web browser.

When launching an interactive app, you specify an ABCI group and a type of ABCI resources.
The interactive app is launched as a batch job that consumes ABCI points from the specified group and uses computational resources of the specified resource type.

Open OnDemand for ABCI provides the following interactive apps:

## Jupyter Lab

Open OnDemand for ABCI provide [Jupyter Lab](https://jupyter.org/), an interactive development environment.
Jupyter Lab is launched on the compute nodes, allowing you to operate it from the browser of your local workstation.

!!! caution
Each time Jupyter Lab is launched, a Python virtual environment for Jupyter Lab will be created in the following path under your home directory. Please delete it periodically.

```
~/ondemand/data/sys/dashboard/batch_connect_sys/jupyter/output/
```

## Qni

Open OnDemand for ABCI provide [Qni](https://qniapp.net/), an interactive quantum circuit design and simulator that operates in the web browser.
Qni on the ABCI offers simulations using the GPUs of ABCI compute nodes.

!!! caution
Qni operates on resource types equipped with GPUs.

Qni uses only one GPU. If you specify a resource type with multiple GPUs, the remaining GPUs will not be used.
Binary file added en/docs/open-ondemand/ood-menu.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
8 changes: 4 additions & 4 deletions en/docs/system-overview.md
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,7 @@ The software available on the ABCI system is shown below.
| OS | Rocky Linux | 8.6 | - |
| OS | Red Hat Enterprise Linux | - | 8.2 |
| Job Scheduler | Altair Grid Engine | 8.6.19_C121_1 | 8.6.19_C121_1 |
| Development Environment | [CUDA Toolkit](gpu.md#cuda-toolkit) | 11.2.2<br>11.6.2<br>11.7.1<br>11.8.0<br>12.1.1<br>12.2.0<br>12.3.2<br>12.4.0<br>12.4.1<br>12.5.0 | 11.2.2<br>11.6.2<br>11.7.1<br>11.8.0<br>12.1.1<br>12.2.0<br>12.3.2<br>12.4.0<br>12.4.1<br>12.5.0 |
| Development Environment | [CUDA Toolkit](gpu.md#cuda-toolkit) | 11.2.2<br>11.6.2<br>11.7.1<br>11.8.0<br>12.1.1<br>12.2.0<br>12.3.2<br>12.4.0<br>12.4.1<br>12.5.0<br>12.5.1 | 11.2.2<br>11.6.2<br>11.7.1<br>11.8.0<br>12.1.1<br>12.2.0<br>12.3.2<br>12.4.0<br>12.4.1<br>12.5.0<br>12.5.1 |
| | Intel oneAPI<br>(compilers and libraries) | 2024.0.2 | 2024.0.2 |
| | Intel VTune | 2024.0.0 | 2024.0.0 |
| | Intel Trace Analyzer and Collector | 2022.0 | 2022.0 |
Expand All @@ -170,7 +170,7 @@ The software available on the ABCI system is shown below.
| | [Python](python.md) | 3.10.14<br>3.11.9<br>3.12.2 | 3.10.14<br>3.11.9<br>3.12.2 |
| | Ruby | 2.5.9-229 | 2.5.5-157 |
| | R | 4.3.3 | 4.3.3 |
| | Java | 1.8.0.402<br>11.0.22.0.7<br>17.0.10.0.7 | 1.8.0.402<br>11.0.22.0.7<br>17.0.10.0.7 |
| | Java | 1.8.0.422<br>11.0.24.0.8<br>17.0.12.0.7 | 1.8.0.422<br>11.0.24.0.8<br>17.0.12.0.7 |
| | Scala | 2.10.6 | 2.10.6 |
| | Perl | 5.26.3 | 5.26.3 |
| | Go | 1.22.2 | 1.22.2 |
Expand All @@ -181,8 +181,8 @@ The software available on the ABCI system is shown below.
| Container | [SingularityPRO](containers.md#singularity) | 4.1.2-2 | 4.1.2-2 |
| | Singularity Endpoint | 2.3.0 | 2.3.0 |
| MPI | [Intel MPI](mpi.md#intel-mpi) | 2021.11 | 2021.11 |
| Library | [cuDNN](gpu.md#cudnn) | 8.1.1<br>8.3.3<br>8.4.1<br>8.6.0<br>8.7.0<br>8.8.1<br>8.9.7<br>9.0.0<br>9.1.1 | 8.1.1<br>8.3.3<br>8.4.1<br>8.6.0<br>8.7.0<br>8.8.1<br>8.9.7<br>9.0.0<br>9.1.1 |
| | [NCCL](gpu.md#nccl) | 2.8.4-1<br>2.11.4-1<br>2.12.12-1<br>2.13.4-1<br>2.14.3-1<br>2.15.5-1<br>2.16.2-1<br>2.17.1-1<br>2.18.5-1<br>2.19.3-1<br>2.20.5-1<br>2.21.5-1 | 2.8.4-1<br>2.11.4-1<br>2.12.12-1<br>2.13.4-1<br>2.14.3-1<br>2.15.5-1<br>2.16.2-1<br>2.17.1-1<br>2.18.5-1<br>2.19.3-1<br>2.20.5-1<br>2.21.5-1 |
| Library | [cuDNN](gpu.md#cudnn) | 8.1.1<br>8.3.3<br>8.4.1<br>8.6.0<br>8.7.0<br>8.8.1<br>8.9.7<br>9.0.0<br>9.1.1<br>9.2.1 | 8.1.1<br>8.3.3<br>8.4.1<br>8.6.0<br>8.7.0<br>8.8.1<br>8.9.7<br>9.0.0<br>9.1.1<br>9.2.1 |
| | [NCCL](gpu.md#nccl) | 2.8.4-1<br>2.11.4-1<br>2.12.12-1<br>2.13.4-1<br>2.14.3-1<br>2.15.5-1<br>2.16.2-1<br>2.17.1-1<br>2.18.5-1<br>2.19.3-1<br>2.20.5-1<br>2.21.5-1<br>2.22.3-1 | 2.8.4-1<br>2.11.4-1<br>2.12.12-1<br>2.13.4-1<br>2.14.3-1<br>2.15.5-1<br>2.16.2-1<br>2.17.1-1<br>2.18.5-1<br>2.19.3-1<br>2.20.5-1<br>2.21.5-1<br>2.22.3-1 |
| | gdrcopy | 2.4.1 | 2.4.1 |
| | UCX | 1.10 | 1.11 |
| | libfabric | 1.7.0-1 | 1.9.0rc1-1 |
Expand Down
21 changes: 21 additions & 0 deletions en/docs/system-updates.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,26 @@
# System Updates

## 2024-08-30 {#2024-08-30}

| Add / Update / Delete | Software | Version | Previous version |
|:--|:--|:--|:--|
| Update | openjdk | 1.8.0.422 | 1.8.0.402 |
| Update | openjdk | 11.0.24.0.8 | 11.0.22.0.7 |
| Update | openjdk | 17.0.12.0.7 | 17.0.10.0.7 |

## 2024-08-08

| Add / Update / Delete | Software | Version | Previous version |
|:--|:--|:--|:--|
| Add | nccl | 2.22.3-1 | |

## 2024-07-31

| Add / Update / Delete | Software | Version | Previous version |
|:--|:--|:--|:--|
| Add | cuda | 12.5.1 | |
| Add | cudnn | 9.2.1 | |

## 2024-06-28

* The specific group area (/projects) is no longer available.
Expand Down
4 changes: 4 additions & 0 deletions en/mkdocs.yml
Original file line number Diff line number Diff line change
Expand Up @@ -53,6 +53,10 @@ nav:
- 'Using s3fs-fuse': 'abci-cloudstorage/s3fs-usage.md'
- 'ABCI Datasets': 'abci-datasets.md'
- 'ABCI Singularity Endpoint': 'abci-singularity-endpoint.md'
- 'Open OnDemand':
- 'Using Open OnDemand': 'open-ondemand/index.md'
- 'Ineractive Apps': 'open-ondemand/interactive-apps.md'
- 'AI Hub': 'open-ondemand/aihub.md'
- 'FAQ': 'faq.md'
- 'Known Issues': 'known-issues.md'
- 'System Updates': 'system-updates.md'
Expand Down
5 changes: 5 additions & 0 deletions ja/docs/gpu.md
Original file line number Diff line number Diff line change
Expand Up @@ -31,6 +31,7 @@ ABCIシステムでは、NVIDIAが提供する以下のライブラリが利用
| cuda/12.4 | 12.4.0 | Yes | Yes | Yes |
| cuda/12.4 | 12.4.1 | Yes | Yes | Yes |
| cuda/12.5 | 12.5.0 | Yes | Yes | Yes |
| cuda/12.5 | 12.5.1 | Yes | Yes | Yes |

[^1]: 試験用に提供しています。Rocky Linux 8.6は、CUDA 11.7.1以降でサポートされます。

Expand All @@ -49,6 +50,7 @@ ABCIシステムでは、NVIDIAが提供する以下のライブラリが利用
| 8.9.7 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.0.0[^2] | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.1.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.2.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |

計算ノード(A):

Expand All @@ -63,6 +65,7 @@ ABCIシステムでは、NVIDIAが提供する以下のライブラリが利用
| 8.9.7 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.0.0[^2] | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.1.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |
| 9.2.1 | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes | Yes |

[^2]: cuDNN 9.0.0をCUDA 11.0から11.3で使用した場合、`cudnnRNNBackwardWeights_v8`関数呼び出し時にエラーが発生することを確認しています。

Expand All @@ -84,6 +87,7 @@ ABCIシステムでは、NVIDIAが提供する以下のライブラリが利用
| 2.19.3-1 | - | - | - | - | - | Yes | Yes | - | - |
| 2.20.5-1 | - | - | - | - | - | Yes | - | Yes | - |
| 2.21.5-1 | - | - | - | - | - | Yes | - | Yes | Yes |
| 2.22.3-1 | - | - | - | - | - | Yes | - | Yes | Yes |

計算ノード(A):

Expand All @@ -101,6 +105,7 @@ ABCIシステムでは、NVIDIAが提供する以下のライブラリが利用
| 2.19.3-1 | - | - | - | - | - | Yes | Yes | - | - |
| 2.20.5-1 | - | - | - | - | - | Yes | - | Yes | - |
| 2.21.5-1 | - | - | - | - | - | Yes | - | Yes | Yes |
| 2.22.3-1 | - | - | - | - | - | Yes | - | Yes | Yes |

## GDRCopy

Expand Down
Loading

0 comments on commit 41292fa

Please sign in to comment.