Skip to main content
Version: sdf-beta1.1

Deployment/Worker Management

There are two modes to run dataflow in SDF: ephemeral and worker. When CLI is run in ephemeral mode, it will run the dataflow in the same process as the CLI. The dataflow will be terminated upon exiting CLI. This is useful for development dataflow as well as testing the package.

Worker is a process that runs the dataflows continuously. Unlike ephemeral CLI, worker is long running process. That means, when you exit CLI, dataflow does not terminate. It is primary used for running dataflow in production environment. It can also be used in development environment to run dataflow when there is no need to test package.

Worker can run anywhere as long as it can connect to the Fluvio cluster. Worker can be run in data center, cloud, or edge device. There is no limit on number of workers that can be run. Each worker can run multiple dataflows. It is recommended to a run a single worker per machine.

SDF can target different worker by selecting the worker profile. Worker profile is associated with the Fluvio profile. When you switch the Fluvio profile, SDF will automatically switch the worker profile. Once you have selected the worker, same worker will be used for all dataflow operation until you select a different worker.

By default, SDF will run the dataflow in the worker. If you are using InfinyOn Cloud, the worker will be provisioned and automatically registered in the profile.

Using Host Worker for Local Cluster

If you are using local cluster, you need either create Host or register the Remote worker. Easiest is to create Host worker. Host worker run in the same machine as the CLI.

To create host worker, you can use the following command:

$> sdf worker create <name>

This will spawn the worker process in the background. It will run until you shutdown the worker or machine is rebooted. The name can be anything as long as it is unique for your machine since profile are not shared across different machines.

Once you have created the worker, You can list them:

$> sdf worker create main
Worker `main` created for cluster: `local`
$> sdf worker list
    NAME  TYPE  CLUSTER  WORKER ID                            
 *  main  Host  local    7fd7eda3-2738-41ef-8edc-9f04e500b919

The * indicates the current selected worker. The worker id is internal unique identifier for current fluvio cluster. Unless specified, it will be generated by the system.

SDF only support running a single HOST worker for each machine since a single worker can support many dataflow. If you try to create another worker, you will get an error message.

$ sdf worker create main2
$ Starting worker: main2
There is already a host worker with pid 20686 running.  Please terminate it first

Shutting down worker will terminate all running dataflow and worker process.

$> sdf worker shutdown main
sdf worker shutdown main
Shutting down pid: 20688
Shutting down pid: 20686
Host worker: main has been shutdown

Even though host worker is shutdown and removed from the profile, the dataflow files and state are still persisted. You can restart the worker and the dataflow will resume.

For example, if you have dataflow fraud-detector and car-processor running in the worker and you shut down the worker, the dataflow process will be terminated. But you can resume by recreating the HOST worker.

$> sdf worker create main

The local worker store the dataflow state in the local file system. The dataflow state is stored in the ~/.sdf/<cluster>/worker/<dataflow>. Sor for local cluster, files will be stored in ~/.sdf/local/worker/dataflows.

if you have delete the fluvio cluster, worker need to be manually shutdown and created again. This limitation will be removed in the future release

Remote Worker

Remote worker is a worker runs in different machine from the CLI. It is typically set up by DevOps team for production environment.

Typical lifecycle for using remote worker is:

  1. Start remote worker in the server.
  2. Register the worker with the name in your machine.
  3. Run the dataflow in the remote worker.
  4. Unregister the worker when it is no longer needed. This doesn't shutdown the worker but remove it from the profile.

Note that there are many ways to manage the remote worker. You can use Kubernetes, Docker, Systemd, Terraform, Ansible, or any other tool that can manage the server process and ensure it can restart when server is rebooted.

InfinyOn cloud is good example of remote worker. When you create a cluster in InfinyOn cloud, it will automatically provision the worker for you. It will also sync the profile when cluster is created.

Once know there are remote worker, you can discover them using sdf worker list -all command.

$> sdf worker list -all
    NAME  TYPE    CLUSTER  WORKER ID                            
 *  main  Host    local    7fd7eda3-2738-41ef-8edc-9f04e500b919
    N/A   Remote  local    edg2-worker-id

These shows a host worker in your local machine and remote worker with id edg2-worker-id running somewhere. To register the remote worker, you can use register command.

$> sdf worker register <name> <worker-id>

Example, registering the remote worker with name edge2 and worker id edg2-worker-id.

$> sdf worker register edge2 edg2-worker-id
Worker `edge2` is registered for cluster: `local`

You can switch among worker using switch command.

$> sdf worker switch <worker_profile>

To unregister the worker after you are done with and no longer need, you can use unregister command:

$> sdf worker unregister <name>

Listing and switching the worker

To list all known worker, you can use list command:

$> sdf worker list
    NAME  TYPE    CLUSTER  WORKER ID                            
 *  main  Host    local    7fd7eda3-2738-41ef-8edc-9f04e500b919
    edge2 Remote  local    edg2-worker-id

To switch the worker, you can use switch command:

$> sdf worker switch <worker-name>

This assumes worker-name is already created or registered.

Managing dataflow in worker

When you are running dataflow in the worker, it will indicate name of the worker in the prompt:

$> sdf run
[main] >> show state

Listing and selecting dataflow

To list all dataflow running in the worker, you can use dataflow list command:

$> sdf dataflow list
[jolly-pond]>> show dataflow 
    Dataflow                 Status  
    wordcount-window-simple  running 
 *  user-job-map             running 
[jolly-pond]>> 

Other commands like show state requires active dataflow. If there is no active dataflow, it will show error message.

[jolly-pond]>> show state 
No dataflow selected.  Run `select dataflow`
[jolly-pond]>> 

To select the dataflow, you can use dataflow select command:

[jolly-pond]>> select dataflow wordcount-window-simple
dataflow switched to: wordcount-window-simple

Deleting dataflow

To delete the dataflow, you can use dataflow delete command:

$> sdf dataflow delete user-job-map
[jolly-pond]>> show dataflow 
    Dataflow                 Status  
    wordcount-window-simple  running

Note that since user-job-map is deleted, it is no longer listed in the dataflow list.

Using worker in InfinyOn Cloud

With InfinyOn Cloud, there is no need to manage the worker. It provisions the worker for you. It also sync profile when cluster is created.

For example, creating cloud cluster will automatically provision and create SDF worker profile.

$> fluvio cloud login --use-oauth2
$> fluvio cloud cluster create
Creating cluster...
Done!
Downloading cluster config
Registered sdf worker: jellyfish
Switched to new profile: jellyfish

You can unregister the cloud worker like any other remote worker.

Advanced: Starting remote worker

To start worker as remote worker, you can use launch command:

$> sdf worker launch --base-dir <dir> --worker-id <worker-id>

where base-dir and worker-id are optional parameters. If you don't specify base-dir, it will use the default directory: /sdf. If you don't specify worker-id, it will generate unique id for you.

This script is typically used by devops team to start the worker in the server.