Using MsPASS With EarthScope GeoLab#

The MsPASS GeoLab image is built from the same GeoLab/Pangeo-style base image contract as the official EarthScope image, with MsPASS installed into the /srv/conda/envs/notebook environment. Non-JupyterHub commands pass through the entrypoint directly so EarthScope Dask Gateway can still start scheduler and worker pods from the image.

Live GeoLab network tests show that Gateway worker pods cannot reach arbitrary ports on the notebook pod, including a notebook-local MongoDB service. For that reason, the default MsPASS GeoLab startup path uses notebook-local MongoDB and notebook-local Dask in the same pod. Dask Gateway remains available for manual use, pure Dask workflows, and future DB-backed workflows where MongoDB is reachable from Gateway workers.

Default notebook runtime#

When the JupyterHub single-user server starts, the GeoLab image starts local MongoDB plus one local Dask scheduler and worker. The scheduler binds to 127.0.0.1 on DASK_SCHEDULER_PORT and notebook kernels default to:

HOME=/home/jovyan
NB_HOME=/home/jovyan
MSPASS_WORK_DIR=/home/jovyan
MSPASS_WORKDIR=/home/jovyan
MSPASS_DB_DIR=/home/jovyan/db
MSPASS_LOG_DIR=/home/jovyan/logs
MSPASS_WORKER_DIR=/home/jovyan/work
MONGO_DATA_DIR=/home/jovyan/db/data
MONGO_LOG=/home/jovyan/logs/mongo_log
MSPASS_DB_ADDRESS=127.0.0.1
MSPASS_SCHEDULER=dask
MSPASS_SCHEDULER_ADDRESS=127.0.0.1

This means the usual notebook client constructor attaches to the pre-started local scheduler instead of creating a hidden LocalCluster:

from mspasspy.client import Client

mspass_client = Client(scheduler="dask")
print(mspass_client.get_database_client().admin.command({"ping": 1}))
print(mspass_client.get_scheduler().scheduler_info())

The MongoDB worker plugin should also be able to reach the local MongoDB service from the local Dask worker:

scheduler = mspass_client.get_scheduler()

def check_worker_db():
    from dask.distributed import get_worker

    worker = get_worker()
    dbclient = worker.data["dbclient"]
    return dbclient.admin.command({"ping": 1})

print(scheduler.submit(check_worker_db).result())

Resetting MongoDB data is always explicit. Set MSPASS_RESET_MONGO_DB=true only when you intentionally want MONGO_DATA_DIR removed before startup. The startup script does not delete the database directory by default.

Disabling local Dask#

Local in-pod Dask can be disabled before the JupyterHub single-user server starts:

MSPASS_ENABLE_LOCAL_DASK=false
MSPASS_SCHEDULER=none

With those settings the startup script still starts notebook-local MongoDB, but does not start a local Dask scheduler or worker. Client(scheduler="none") or MSPASS_SCHEDULER=none leaves the MsPASS scheduler unset. Setting MSPASS_ENABLE_LOCAL_DASK=false by itself also changes the default local dask scheduler setting to none unless an external scheduler address is configured.

Dask Gateway#

EarthScope GeoLab provides distributed Dask through Dask Gateway. Installing dask-gateway manually inside a running notebook pod is not sufficient because Gateway-created scheduler and worker pods are started from the image, not from the notebook pod’s mutated runtime filesystem. The MsPASS GeoLab image keeps dask, distributed, dask_gateway, and mspasspy importable in notebook, scheduler, and worker pods.

A Gateway cluster starts with no workers until it is scaled or configured for adaptive scaling:

from dask_gateway import Gateway

gateway = Gateway()
cluster = gateway.new_cluster()
cluster.scale(1)

dask_client = cluster.get_client()
dask_client.wait_for_workers(1, timeout="120s")
print(dask_client.scheduler_info())

To verify that a Gateway worker was created from an image containing the MsPASS runtime packages:

def check_worker():
    import os
    import socket
    import sys

    import dask
    import dask_gateway
    import distributed
    import mspasspy

    return {
        "host": socket.gethostname(),
        "python": sys.executable,
        "home": os.environ.get("HOME"),
        "cwd": os.getcwd(),
        "dask": dask.__version__,
        "distributed": distributed.__version__,
        "dask_gateway": dask_gateway.__version__,
        "mspasspy": mspasspy.__file__,
    }

print(dask_client.submit(check_worker).result())

MsPASS can accept an externally-created Dask client, including one returned by Dask Gateway:

from dask_gateway import Gateway
from mspasspy.client import Client

gateway = Gateway()
cluster = gateway.new_cluster()
cluster.scale(2)

dask_client = cluster.get_client()
dask_client.wait_for_workers(2, timeout="120s")

mspass_client = Client(scheduler="dask", dask_client=dask_client)

Keep the cluster object alive while using mspass_client. MsPASS uses the provided Dask client but does not own or shut down the Gateway cluster.

DB-backed MsPASS workflows through Gateway need a MongoDB endpoint that Gateway workers can reach. Passing a Gateway client to Client still registers the MongoDB worker plugin, so notebook-local 127.0.0.1 MongoDB is not a valid Gateway worker database address.

Rebuilt image smoke checks#

In a fresh GeoLab server using the rebuilt MsPASS GeoLab image, verify the workspace and runtime environment:

whoami
pwd
echo $HOME
echo $NB_HOME
echo $MSPASS_WORK_DIR
echo $MSPASS_WORKDIR
echo $MSPASS_DB_ADDRESS
echo $MSPASS_SCHEDULER
echo $MSPASS_SCHEDULER_ADDRESS
echo $MSPASS_DB_DIR
echo $MONGO_DATA_DIR
echo $MSPASS_LOG_DIR
echo $MSPASS_WORKER_DIR
mongosh --host 127.0.0.1 --port 27017 --eval 'db.adminCommand({ping: 1})'
tail -50 "$MONGO_LOG"

Expected values are /home/jovyan for the workspace and home variables, 127.0.0.1 for local MongoDB and local Dask addresses, and runtime directories under /home/jovyan. The expected notebook processes are mongod, dask scheduler, dask worker, jupyterhub-singleuser, and notebook kernel processes.

To verify installed package versions:

import dask
import distributed
import dask_gateway
import mspasspy

print(dask.__version__)
print(distributed.__version__)
print(dask_gateway.__version__)
print(mspasspy.__file__)