Gpu accelerates aimodeldevelopmentandanalyticsutilizingelasticsearchandazure ai

GPU で加速する AI モデル作成と解析
〜 Elasticsearch と Azure AI を活⽤〜
鈴⽊章太郎
Elastic テクニカルプロダクトマーケティングマネージャー/エバンジェリスト
内閣官房 IT 総合戦略室政府 CIO 補佐官

Elastic
Technical Product Marketing
Manager/Evangelist
内閣官房 IT 総合戦略室
政府 CIO 補佐官
元 Microsoft Technical Evangelist
Twitter : @shosuz
Shotaro Suzuki

Agenda
• AI ソリューションの課題
• Microsoft AI + Elastic による解決

AI ソリューションの課題

機械学習の知識 + Cloud インフラの知識

Microsoft AI + Elastic による解決

Microsoft Azure の⼈⼯知能
AI 向けのスケーラブルで信頼性の⾼い
クラウドプラットフォーム
• データストレージ
• 計算
• サービス
機械学習モデルのトレーニング、デプロイ、および
管理のためのプラットフォーム
開発者が AI ソリューションの構築に使⽤できる
⼀連のサービス
ボットを開発および管理するためのクラウドベース
のプラットフォーム

Azure Cloud Services
Compute (Container) / Storage
Python & R SDK
üデータの加⼯
üモデルの学習
üモデルの管理
üモデルの展開と追跡
機械学習 / 深層学習を Azure で⾏うための
ベストプラクティス

Notebooks Automated ML UX Designer
Reproducibility Automation Deployment Re-training
CPU, GPU, FPGAs IoT Edge
モデルの構築・展開を、個⼈から企業レベルでも

Elastic テクノロジー概要
Elastic スタックで実現
Kibana
Elasticsearch
Beats Logstash
Elastic エンタープライズサーチ Elastic セキュリティ
Elastic オブザーバビリティ
3つのソリューション
SaaS
(AWS/Azure/GCP)
IaaS
(クラウド & オンプレ）
Elastic Cloud
on Kubernetes
Elastic Cloud Elastic Cloud
Enterprise
豊富なデプロイ選択肢
Kubernetes
(クラウド & オンプレ）

Azure サービスとの連携パターン (例)
Seamless connectivity with Beats, Logstash and Azure Serverless
Elasticsearch Service
Elastic Cloudon Kubernetes
Elastic Cloud Enterprise
Elastic Stack
beats
Azure Functions,
etc
Logstash
Modules
Connectors
Logic App
Modules
SaaS
VM
Azure
Monitor
WebApp
Storage
EventHub
IoTHub
Database
ログファイル
Windows イベントログ
メトリック
Audit
etc.
Azure Blob Storage
Azure Event Hub
Azure IoT Hub
HTTP
etc.
Azure Blob Storage
Azure Table Storage
Azure Service Bus
Azure Event Hub
Azure IoT Hub
Azure SQL Database
Azure Database for MySQL
Azure Database for PostgreSQL
Azure Database for MariaDB
Application Insights

あらゆるスキル
レベルに
モデルの
ライフサイクル
管理 (MLOps)
責任ある ML
オープンで
相互運⽤可能

• あらゆるスキルレベルに対応

Designer SDK for Code
AutoML in GUI

MLOps による全ライフサイクル管理

Model reproducibility Model
retraining
Model deployment
Model validation
Train
model
Validate
model
Deploy
model
Monitor
model
Retrain model
Build app
Collaborate Test app Release
app
Monitor
app
App Developer
CI / CD Tools
Data Scientist
Azure Machine Learning
GitHub Actions or Azure DevOps による⾃動化
監査証跡の管理とモデルの解釈可能性

マイクロサービス
コンテナ
CI / CD
オーケストレーション
サーバレスクラウド
ソフトウェアの開発⽅法とデリバリは常に進化

10
の組織は
種類以上の
監視ツールを使⽤
65%

開発チーム
運⽤:
ログ監視
可動性
応答時間
アップタイム
ツール
運⽤︓
インフラ監視
ウェブログ
アプリログ
データベースログ
コンテナログ
ログツール
リアルユーザー監視
トランザクション
パフォーマンス監視
分散トレーシング
APM ツール
運⽤︓
サービス監視
コンテナ指標
ホスト指標
データベース指標
ネットワーク指標
ストレージ指標
メトリック
ツール
ビジネス KPI
ビジネスツール
ビジネス
チーム
現状ー典型的なオブザーバビリティのツール群

開発、運⽤、ビジネスチーム
Elastic のオブザーバビリティへのアプローチ
APM データ
アップタイム
データ
指標データ
ログデータ
ビジネス
データ
全ての運⽤にまつわるデータを、⼀つの強⼒なデータストアに集約 -
Elasticsearch

オープンで相互運⽤可能、責任ある ML

開発ツール⾔語フレームワーク

モデルの解釈プライバシーの保護展開の制御

• 機械学習のためのデータ管理

https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-track-experiments
メトリック、データ、モデル等の
⼤事な資産の共有と運⽤管理
Experiment
実験
メトリック
データセットモデル
Workspace
バージョン管理
オンプレミスへのデプロイ
パラメータ値
モデル精度の可視化
データ定義の管理
スナップショット

機械学習パイプライン構築、テスト、デプロイするためのビジュアルワークフロー
• 直感的なマウス操作によるパイプライン構築
• 特徴量エンジニアリング
• モデル学習 (回帰、分類、クラスタリング)
• 推論 (リアルタイム & バッチ推論)
• カスタムモデル・スクリプト (Python, R)

Docker
クラウド
⾼性能エッジ
軽量エッジ
Model
アプリケーション
リアルタイム分析
外観検査、顧客分析、
製造プロセス⾃動化 …
⾃動運転 …
Azure Portal
プログラム or GUI で操作可能

Workspace
https://docs.microsoft.com/ja-jp/azure/machine-learning/service/concept-azure-machine-learning-architecture
https://docs.microsoft.com/ja-jp/azure/machine-learning/service/concept-workspace

• 学習作業毎の Compute Resource 論理単位
– 学習
– モデル
• 複数の Workspace
• Azure Resources
– Azure Container Resistor – モデルを Docker Container 化してセキュアに管理
– Azure Storage – 学習データ、テレメトリー、モデルファイルなどのストレージ
– Azure Application Insight – モデルのモニタリング
– Azure Key Vault – 学習・推論 Compute のクレデンシャルなどセキュアな情報管理

• Model
– 機械学習の結果のファイル
• Azure Machine Learning
servicers 以外で作成した
モデルも扱える
– 様々な機械学習 / 深層学習
のフレームワークの Model
– Workspace の中で管理
• Model Registry
– ラベル付けでのバージョン管理
– 追加のメタデータ
– Image 化して使っているもの
は削除できない

• 1: Model Registry へ登録
• 2: Image Registry (Azure Container Registry) へ登録
• 3: Image を展開
• 4: Model の監視

• Python Script の学習ジョブ
– Workspace で管理
– 実⾏ログの保存
• Timestamp, duration など
• 標準⼊出⼒
– 実⾏ Compute 環境をアタッチ
• ジョブ毎に変更可能

• 学習の環境
– Azure Compute の抽象化
• 学習ジョブ単位でアタッチ
コンピューティングターゲット GPU アクセラレーション
⾃動
ハイパーパラメーター調整
⾃動
機械学習
パイプライン親和性
ローカルコンピューター可能性あり ✓
Azure Machine
Learning コンピューティング
✓ ✓ ✓ ✓
リモート VM ✓ ✓ ✓ ✓
Azure Databricks ✓ ✓*
Azure Data Lake
Analytics
✓*
Azure HDInsight ✓

• Image
– Model
– Model の⼊出⼒を抽象化した Script
– Model もしくは Scoring 実⾏の依存関係
• 2 種類
– FPGA Image: Azure 内の FPGAクラスターへ
– Docker Image: 任意の場所の Docker 実⾏環境へ
• Image Registry
– Model から作成された
Image の管理
– メタデータ, 検索

+ + = Increasing storage costs
Difficult to track & audit
Insecure and fragile
Sources Formats
Environments Challenges

– Azure Storage Account の抽象化
• Azure Blob
• Azure File
– Workspace でデフォルトの Data Store を持つ
• 追加可能
– Python SDK もしくは Azure CLI から制御

簡単な統合
- Filebeat Azure Module
- Metricbeat Azure Module

[Metricbeat] Azure からのメトリックの収集
オプション
Azure Monitor metrics
Metricbeat on Azure instances
Metricbeat
Azure module
Any machine
Ingest Node
Elasticsearch
Data Node

サポートされている Azure メトリックス
Metricsets (Azure モニター経由)
- モニター
- compute_vm
- compute_vm_scaleset
- ストレージ (BLOB、テーブル、
キュー、ファイル)
- データベースアカウント
- コンテナー
Azure メトリックスの機能
- 集計
- ディメンション
- タイムグレイン
metrics:
- name: "Requests"
namespace: "Microsoft.ApiManagement/service"
aggregations: ["Maximum"]
timegrain: "PT1M"
dimensions:
- name: "Hostname"
value: "apimanagement.azure-api.net"

[Filebeat] Azure でのログ/イベントの収集
オプション
Filebeat on VM instances
or containers (daemonset)
Ingest Node
Elasticsearch
Data Node
Event Hub Filebeat

Azure ログのモジュール
アクティビティ, サインイン, 監査
• アクティビティログ
‒ サブスクリプション内のリソースに対して
実⾏された操作に関する洞察を提供
• サインインログ
‒ マネージアプリケーションとユーザーの
サインインアクティビティの使⽤状況に
関する情報を提供
• 監査ログ
‒ Azure AD 内の様々な機能によって
⾏われたすべての変更に対して、ログを
通じてトレーサビリティを提供

DATA SCIENTIST
DEVELOPER
DATA ENGINEER

主要な深層学習・機械学習ライブラリの抽象化クラス
from azureml.train.estimator import Estimator
script_params = { ‘--learning-rate’: 0.3, '--regularization': 0.8 }
est = Estimator(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train.py’,
conda_packages=['scikit-learn'])

LightGBM Horovod
https://docs.microsoft.com/ja-JP/azure/machine-learning/service/how-to-train-ml-models
参考︓Azure Machine Learning で Estimator を使⽤してモデルをトレーニングする

• Python SDK Training 利⽤ステップ

• Workspace への Configuration 設定
• Compute 設定
• DataStore 設定
– データのアップロード
• Script ランタイム設定 (Estimator)
– Entry Point
– 依存パッケージ
• Job Submit
• 結果の確認とモデル保存
• モデルの展開

• Scikit-learn での MNIST データセットをロジスティック回帰で
処理
– MNIST
• ⼿書きの数字 – 0 から 9
• 70,000 データ
• 28x28 pixels
– 数字の分類

from azureml.core import Workspace
ws = Workspace.create(name='myworkspace',
subscription_id='<azure-subscription-id>',
resource_group='myresourcegroup',
create_resource_group=True,
location='eastus2' # or other supported Azure region
)
# see workspace details
ws.get_details()
Step 2 – Create an Experiment
experiment_name = ‘my-experiment-1'
from azureml.core import Experiment
exp = Experiment(workspace=ws, name=experiment_name)
Step 1 – Create a workspace

Step 3 – Create remote compute target
# choose a name for your cluster, specify min and max nodes
compute_name = os.environ.get("BATCHAI_CLUSTER_NAME", "cpucluster")
compute_min_nodes = os.environ.get("BATCHAI_CLUSTER_MIN_NODES", 0)
compute_max_nodes = os.environ.get("BATCHAI_CLUSTER_MAX_NODES", 4)
# This example uses CPU VM. For using GPU VM, set SKU to STANDARD_NC6
vm_size = os.environ.get("BATCHAI_CLUSTER_SKU", "STANDARD_D2_V2")
provisioning_config = AmlCompute.provisioning_configuration(
vm_size = vm_size,
min_nodes = compute_min_nodes,
max_nodes = compute_max_nodes)
# create the cluster
print(‘ creating a new compute target... ')
compute_target = ComputeTarget.create(ws, compute_name, provisioning_config)
# You can poll for a minimum number of nodes and for a specific timeout.
# if no min node count is provided it will use the scale settings for the cluster
compute_target.wait_for_completion(show_output=True,
min_node_count=None, timeout_in_minutes=20)

# note that while loading, we are shrinking the intensity values (X) from 0-255 to 0-1 so that
the model converge faster.
X_train = load_data('./data/train-images.gz', False) / 255.0
y_train = load_data('./data/train-labels.gz', True).reshape(-1)
X_test = load_data('./data/test-images.gz', False) / 255.0
y_test = load_data('./data/test-labels.gz', True).reshape(-1)
圧縮されたデータを numpy へロード。 ʻload_dataʼ はカスタム関数。
Data Store 設定。これで、どこからでも Azure Storage 上への読み書きが可能に。
ds = ws.get_default_datastore()
print(ds.datastore_type, ds.account_name, ds.container_name)
ds.upload(src_dir='./data', target_path='mnist', overwrite=True, show_progress=True)
これで学習の準備が完了
Step 4 – Upload data to the cloud

%%time from sklearn.linear_model import LogisticRegression
clf = LogisticRegression()
clf.fit(X_train, y_train)
# Next, make predictions using the test set and calculate the accuracy
y_hat = clf.predict(X_test)
print(np.average(y_hat == y_test))
Model の Accuracy の結果が表⽰される [0.915 位]
Scikit-learn の logistic regression の学習ジョブを実⾏。通常は数分で終了。
Step 5 – Train a local model

リモート Computer で実⾏する場合には、以下のステップが必要
• 6.1: Create a directory
• 6.2: Create a training script
• 6.3: Create an estimator object
• 6.4: Submit the job
Step 6.1 – Create a directory
import os
script_folder = './sklearn-mnist' os.makedirs(script_folder, exist_ok=True)
Step 6 – Train model on remote cluster

%%writefile $script_folder/train.py
# load train and test set into numpy arrays
# Note: we scale the pixel intensity values to 0-1 (by dividing it with 255.0) so
# the model can converge faster.
# ‘data_folder’ variable holds the location of the data files (from datastore)
Reg = 0.8 # regularization rate of the logistic regression model.
X_train = load_data(os.path.join(data_folder, 'train-images.gz'), False) / 255.0
X_test = load_data(os.path.join(data_folder, 'test-images.gz'), False) / 255.0
y_train = load_data(os.path.join(data_folder, 'train-labels.gz'), True).reshape(-1)
y_test = load_data(os.path.join(data_folder, 'test-labels.gz'), True).reshape(-1)
print(X_train.shape, y_train.shape, X_test.shape, y_test.shape, sep = '¥n’)
# get hold of the current run
run = Run.get_context()
#Train a logistic regression model with regularizaion rate of’ ‘reg’
clf = LogisticRegression(C=1.0/reg, random_state=42)
clf.fit(X_train, y_train)
Step 6.2 – Create a Training Script (1/2)

print('Predict the test set’)
y_hat = clf.predict(X_test)
# calculate accuracy on the prediction
acc = np.average(y_hat == y_test)
print('Accuracy is', acc)
run.log('regularization rate', np.float(args.reg))
run.log('accuracy', np.float(acc)) os.makedirs('outputs', exist_ok=True)
# The training script saves the model into a directory named ‘outputs’. Note files saved
# in the outputs folder are automatically uploaded into experiment record. Anything written
# in this directory is automatically uploaded into the workspace.
joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')
Step 6.2 – Create a Training Script (2/2)

• 学習中のメトリックを個別に保存
– Standard Input / Output 以外に
from import
https://docs.microsoft.com/ja-jp/azure/machine-learning/service/how-to-track-experiments

• DataStore のデフォルトの場所への保存
/output/
model_file_name = 'ridge_{0:.2f}.pkl'.format(alpha)
# save model in the outputs folder so it automatically get uploaded
with open(model_file_name, "wb") as file:
joblib.dump(value=reg, filename=os.path.join('./outputs/', model_file_name))

Estimator が学習ジョブを実⾏
from azureml.train.estimator import Estimator
script_params = { '--data-folder': ds.as_mount(), '--regularization': 0.8 }
est = Estimator(source_directory=script_folder,
script_params=script_params,
compute_target=compute_target,
entry_script='train.py’,
conda_packages=['scikit-learn'])
Step 6.4 – Submit the job to the cluster for training
run = exp.submit(config=est)
run
Step 6.3 – Create an Estimator

Post-Processing
./outputs ディレクトリーに実⾏結果を出⼒。
Workspace からそれぞれアクセスできる
Running
Compute Target へ必要なスクリプトなどがコピー。
その後、Data Store がマウントもしくは
データのコピーが⾏われる。その後 entry_script で指定した
script ファイルが実⾏される。
ジョブの実⾏中に stdout は /logs に
ストリーム出⼒。ジョブ実⾏中にも確認が出来る
Image creation
Estimator で指定されたパラメーターを元に Docker
Image のビルド。Workspace へ登録。約5分。
初回のみ。2度⽬からはキャッシュからロードされる。
Docker Build の状況は Docker Build の
ログから確認できる
Scaling
学習⽤の Cluster が更に Compute
Resource が必要になると、⾃動的に追加。
Scale out は通常5分程度

Step 7 – Monitor a run
Jupyter widget でジョブの状態をモニタリング。10-15 秒程度の遅延で⾮同期で表⽰される
from azureml.widgets import RunDetails
RunDetails(run).show()
Azure Machine Learning services の widget の例:

Step 8 – See the results
モデルの学習ジョブは⾮同期で実⾏される。ここでは、それが終わるまで待機するシグナル送信。
wait_for_completion にて
run.wait_for_completion(show_output=False)
# now there is a trained model on the remote cluster
print(run.get_metrics())
{'regularization rate': 0.8, 'accuracy': 0.9204}

Step 9 – Register the model
ファイルを ʻoutputs/sklearn_mnist_model.pklʼ へ出⼒。
ʻoutputsʼ ディレクトリーは、ジョブを実⾏した仮想マシンの中。
• outputs は特別なディレクトリー。この中の全てのファイルは、Workspace のストレージへコピーされる。
• 実⾏ジョブ履歴
• Modelファイル
• など
joblib.dump(value=clf, filename='outputs/sklearn_mnist_model.pkl')
トレーニングジョブの最終ステップを呼び出し:
# register the model in the workspace
model = run.register_model (
model_name='sklearn_mnist’,
model_path='outputs/sklearn_mnist_model.pkl’)
Model が Workspace に登録され、クエリできるようになる

Step 9.1 – Create the scoring script
Scoring ⽤の Script作成。score.py。Web Services として設定される
必須 function: init() と run (input data)
from azureml.core.model import Model
def init():
global model
# retreive the path to the model file using the model name
model_path = Model.get_model_path('sklearn_mnist’)
model = joblib.load(model_path)
def run(raw_data):
data = np.array(json.loads(raw_data)['data’])
# make prediction
y_hat = model.predict(data) return json.dumps(y_hat.tolist())

Step 9.2 – Create environment file
environment file の作成。ここでは myenv.yml。 Script実⾏のための依存関係 Package を指定したもの。
Docker Image 作成時に使⽤される。
このサンプルでは、 scikit-learn と azureml-sdk が指定されている
from azureml.core.conda_dependencies import CondaDependencies
myenv = CondaDependencies()
myenv.add_conda_package("scikit-learn")
with open("myenv.yml","w") as f:
f.write(myenv.serialize_to_string())
Step 9.3 – Create configuration file
展開⽤の configuration ファイルと、CPU数、GB単位のRAM容量など、ACI 作成に必要なパラメーターを設定。
デフォルト値:1 core 1 gigabyte RAM
from azureml.core.webservice import AciWebservice
aciconfig = AciWebservice.deploy_configuration(cpu_cores=1, memory_gb=1,
tags={"data": "MNIST", "method" : "sklearn"},
description='Predict MNIST with sklearn')

Step 9.4 – Deploy the model to ACI
%%time
from azureml.core.webservice import Webservice
from azureml.core.image import ContainerImage
# configure the image
image_config = ContainerImage.image_configuration(
execution_script ="score.py",
runtime ="python",
conda_file ="myenv.yml")
service = Webservice.deploy_from_model(workspace=ws, name='sklearn-mnist-svc’,
deployment_config=aciconfig,
models=[model],
image_config=image_config)
service.wait_for_deployment(show_output=True)
•
•
•

Step 10 – Test the deployed model
using the HTTP end point
import requests
import json
# send a random row from the test set to score
random_index = np.random.randint(0, len(X_test)-1)
input_data = "{¥"data¥": [" + str(list(X_test[random_index])) + "]}"
headers = {'Content-Type':'application/json’}
resp = requests.post(service.scoring_uri, input_data, headers=headers)
print("POST to url", service.scoring_uri)
#print("input data:", input_data)
print("label:", y_test[random_index])
print("prediction:", resp.text)

• Elastic Observability を使⽤した NVIDIA
GPU メトリックの監視

依存関係 (1)
• NVIDIA GPU メトリックを稼働させるには、ソースコード（Go）から NVIDIA
GPU 監視ツールを構築
• NVIDIA GPU は、Microsoft Azure、Google Cloud 、AmazonWeb
Services（AWS）などの多くのクラウドプロバイダーから⼊⼿可能
（図は Genesis Cloud)
• NVIDIA の Ubuntu18.04 ⽤ DCGM スタートガイドのインストールセクションに
従って、NVIDIA Datacenter Manager をインストール
• <architecture>パラメーターを独⾃のパラメーターに置き換えることに特に注意
• unameコマンドを使⽤してアーキテクチャを⾒つける
uname - a
X86_64がアーキテクチャであるとの回答。従ってスタートガイドのステップ1は次の通り
echo "deb
http://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64 /"
| sudo tee /etc/apt/sources.list.d/cuda.list
sudo apt-key adv --fetch-keys
https://developer.download.nvidia.com/compute/cuda/repos/$distribution/x86_64/7
fa2af80.pub

依存関係 (2)
• インストール後、nvidia-smi コマンドを実⾏することで、GPU 詳細を⾒ることができる

• NVIDIA の gpu-monitoring-tools をビルドするには、Golang をインストールする必要あり
• NVIDIA の gpu-monitoring-tools を GitHub からインストールして、NVIDIA のセットアップを終了
cd /tmp
wget https://golang.org/dl/go1.15.7.linux-amd64.tar.gz
sudo mv go1.15.7.linux-amd64.tar.gz /usr/local/
cd /usr/local/
sudo tar -zxf go1.15.7.linux-amd64.tar.gz
sudo rm go1.15.7.linux-amd64.tar.gz
cd /tmp
git clone https://github.com/NVIDIA/gpu-monitoring-tools.git
cd gpu-monitoring-tools/
sudo env "PATH=$PATH:/usr/local/go/bin" make install

• Metricbeat をインストールする準備が整ったので、elastic.co で Metricbeat の最新版を確認
• 以下のコマンドでバージョン番号を調整
• この場合は 7.10.2 がバージョン番号
cd /tmp
wget https://artifacts.elastic.co/downloads/beats/metricbeat/metricbeat-7.10.2-amd64.deb
sudo dpkg -i metricbeat-7.10.2-amd64.deb #

• Elastic Stack を起動して実⾏
• 新しい GPU モニタリングデータ⽤のホームが必要なため、Elastic Cloud に新しいデプロイメントを作成
• Elastic Cloud を初めて使う場合は、14⽇間の無料トライアルにサインアップ
• 独⾃の展開をローカルで設定することも可能
• ElasticCloud に新しい ElasticObservability デプロイメントを作成

• Metricbeat の構成ファイルは /etc/metricbeat/metricbeat.yml
• 前ページのセットアップで取得したパラメーター cloud.id と cloud.auth を編集
• 構成変更例︓
• Metricbeat の⼊⼒構成はモジュール式
• NVIDIA gpu-monitoring-tools は Prometheus を介して GPU メトリックを公開するので、先に
PrometheusMetricbeat モジュールを有効化
• Metricbeat の test コマンドと modules コマンドを使⽤して、Metricbeat の構成が成功したことを確認
cloud.id:
"staging:dXMtY2VudHJhbDEuZ2NwLmNsb3VkLmVzLmlvJDM4ODZkYmUwMWNjODQ2NDM4YjRlNzg5OWEyZD
AwNGM5JDBiMTc0YzYyMTVlYTQwYWQ5M2NmMGY4MjVhNzJmOGRk"
cloud.auth: "elastic:J7KYiDku2wP7DFr62zV4zL4y"
sudo metricbeat modules enable prometheus
sudo metricbeat test config

sudo metricbeat test output
sudo metricbeat modules list
sudo metricbeat setup
• 左記の例のように構成テストがうまくいかない場合は、
Metricbeat トラブルシューティングガイドを参照
• Metricbeat の構成の最後に、setup コマンドを実⾏
• いくつかのデフォルトダッシュボードをロードし、インデックス
マッピングを設定
• セットアップコマンドの実⾏には通常数分かかる

dcgm-exporter --address localhost:9090 #
Output INFO[0000] Starting dcgm-exporter
INFO[0000] DCGM successfully initialized!
INFO[0000] Not collecting DCP metrics: Error
getting supported metrics: This request is
serviced by a module of DCGM that is not
currently loaded INFO[0000] Pipeline starting
INFO[0000] Starting webserver
• NVIDIA の dcgm-exporter を起動
• 注: DCP 警告は無視できる
• dcgm-exporter メトリックの設定は、ファイル
/etc/dcgm-exporter/default-
counters.csv で定義され、デフォルトでは 38 個の
異なるメトリックが定義されている。
• 使⽤可能な値の完全なリストについては、DCGM ライブ
ラリ API リファレンスガイドを確認
• 別のコンソールで、Metric Beat を起動
sudo metricbeat -e
• Kibana で「metricbeat-*」インデックスパターンを更新
• [Stack Management] > [Kibana] > [Index
Patterns] に移動して、リストから metricbeat-* インデ
ックスパターンを選択
• 次に「フィールドリストを更新」をクリック

• GPU メトリクスが Kibana で利⽤できる
• 新しいフィールド名の前には
prometheus.metrics.DCGM_
• Kibana の Discover で確認
• これで、Elastic Observability で GPU
メトリクスを分析する準備が整う
• Metrics Explorer で GPU と CPU のパフォーマンスを⽐較できる
• インベントリビューで GPU 利⽤のホットスポットを⾒つけられる

• これらはほんの⼀部のモニタリング⽅法
• Elastic Observability を使えば、すべての⽬標に取り組むことが可能
• NVIDIA による監視するのに適した GPU の他の例をいくつかご紹介︓
• GPU temperature: GPU の温度。ホットスポットのチェック
• GPU power usage: GPUの電⼒使⽤量。予想以上に電⼒使⽤量が多い⇒HWの問題の可能性
• Current clock speeds: 現在のクロックスピード。想定よりも低い⇒パワーキャッピングやHWの問題
• また、GPUの負荷をシミュレーションする必要がある場合は、dcgmproftester10 コマンドを使⽤
dcgmproftester10 --no-dcgm-validation -t 1004 -d 30

まとめ
• AI ソリューションの課題
• Microsoft AI + Elastic による解決

Open Source Repo Link
Azure ML Notebook
Examples
Azure Machine Learning
公式サンプルコード
https://aka.ms/ml-notebooks
BERT Large ⾃然⾔語モデル BERT のサンプルコード http://aka.ms/azure-bert
Microsoft Recommenders レコメンデーションサンプルコード http://aka.ms/recommenders
LightGBM LightGBM トップページ https://aka.ms/lightgbm
Natural Language Recipe's ⾃然⾔語サンプルコード https://aka.ms/nlp-recipes
ONNX ONNX トップページ https://aka.ms/onnx
ONNX RT ONNX Runtimeトップページ https://aka.ms/onnx-rt
Kubeflow & MLOps
Kubeflow + Azure ML + DevOps サン
プルコード
https://aka.ms/kubeflow-and-mlops
Azure Open Datasets Azure Open Datasets Webページ https://aka.ms/azure-open-datasets
Azure ML Free Trial Azure フリートライアル https://aka.ms/amlfree
Azure ML Docs Azure Machine Learning ドキュメント https://aka.ms/azureml-ja-docs

• Azure ML Studio: https://ml.azure.com
• Demo Notebook:
https://aka.ms/ignite2019brk3303democode
• Documentation
– Datasets
– Creating data labeling project
– Labeling data
• Contact for ML assisted labeling: ml-assisted-
labeling@microsoft.com

• Microsoft Responsible AI Resource Center
• https://aka.ms/RAIresources
• Azure Machine Learning
• https://azure.microsoft.com/en-
us/services/machine-learning/
• https://docs.microsoft.com/en-
us/azure/machine-learning/concept-
responsible-ml
• OpenDP
• http://opendp.io/
• https://twitter.com/opendp_io
• WhiteNoise
• https://github.com/opendifferentialprivacy
• https://docs.microsoft.com/azure/machine-
learning/concept-differential-privacy
learning/how-to-differential-privacy
• https://aka.ms/WhiteNoiseWhitePaper
• SEAL
• https://github.com/Microsoft/SEAL
learning/how-to-homomorphic-encryption-
seal
• https://aka.ms/SEALonAML

Elastic リソース
• 公式ドキュメント
– https://www.elastic.co/guide/index.html
• Elasticsearch.Net & NESTドキュメント
– https://www.elastic.co/guide/en/elasticsearch/client/net-
api/current/index.html
• Elastic 事例
– https://www.elastic.co/jp/customers/

アプリケーション開発オンデマンドウェビナー特集
https://www.microsoft.com/ja-jp/events/top/apps-innovation-webinars.aspx
• Elastic の Search API を Visual Studio
Code でコーディングする (1) - (3)
• Elastic Cloud で Azure Kubernetes
Serviecs の様々な Log/Metrics/APM を
可視化する
• ASP.NET Core 3.x Web アプリのログを
Elastic Cloud で収集・分析してみよう︕

.NET lab 2021.5
https://dotnetlab.connpass.com/event/208867/
セッションタイトル・概要 : TBD

Google Cloud Day Digital 2021
https://cloudonair.withgoogle.com/events/google-cloud-day-digital-21?talk=d2-gl-27
クラウドネイティブへの移行における Elastic APM の概要

Gpu accelerates aimodeldevelopmentandanalyticsutilizingelasticsearchandazure ai

Recomendados

Recomendados

Mais conteúdo relacionado

Mais procurados

Mais procurados (20)

Semelhante a Gpu accelerates aimodeldevelopmentandanalyticsutilizingelasticsearchandazure ai

Semelhante a Gpu accelerates aimodeldevelopmentandanalyticsutilizingelasticsearchandazure ai (20)

Mais de Shotaro Suzuki

Mais de Shotaro Suzuki (20)

Último

Último (11)

Gpu accelerates aimodeldevelopmentandanalyticsutilizingelasticsearchandazure ai