DOCA SDK Documentation

DOCA Storage Comch to RDMA Zero Copy Application Guide

Introduction

The doca_storage_comch_to_rdma_zero_copy application serves as a bridge between the initiator and a single storage target. It's only role in the data path is to forward the io requests and io responses between the initiator and storage target.

System Design

The doca_storage_comch_to_rdma_zero_copy application performs the following functions:

  • Relay of io requests from the initiator to the storage target

  • Relay of io responses from the storage target to the initiator

To achieve this it expects to be able to connect to a storage target using TCP connections and will then listen for an incoming connection from a single initiator using doca_comch_server.

Architecture

The doca_storage_comch_to_rdma_zero_copy application is split into to two functional areas:

  • Control time and shared resources

  • Per thread data path resources

zero_copy - objects.png

The flow of the application similarity executes in two main phases:

  • Control phase

  • Data path phase

Control Phase

The state starts by connecting to the storage target, then waiting for a client connection. Once all connections are established the application waits for the appropriate control commands:

  • Query storage

  • Init storage

  • Start storage

Processing each control command follows a similar pattern of:

  • Relay the command to the storage target

  • Wait for the storage target to respond

  • Do the required post processing and consistency checks on the storage responses

  • Respond to the client

The start storage control command will kick off the data path phase. Data threads will begin executing while the main thread proceeds to wait for the final control messages to complete the application lifecycle:

  • Stop storage

  • Shutdown

Data Path Phase

This phase happens per thread and involves each thread performing the requested IO operations requested by the client. Read and write requests are simply forwarded to the storage target, no actual processing is carried out by the data threads.

Read Data Flow

The regular read flow consists of the stages detailed in the following subsections.

1. Initiator Request
  1. The initiator sends an I/O request to the zero copy application.

  2. The zero copy application forwards the request verbatim to the storage target

zero_copy - read 01 - IO request.png

2. RDMA Transfer
  1. The storage target performs a RDMA write operation

zero_copy - read 02 - RDMA.png

3. Target Response
  1. The zero copy application receives a response from the storage target

  2. The zero copy application forwards the request verbatim to the initiator

zero_copy - read 03 - IO response.png

Write Data Flow

1. Initiator Request
  1. The initiator sends an I/O request to the zero copy application.

  2. The zero copy application forwards the request verbatim to the storage target

zero_copy - read 01 - IO request.png

2. RDMA Transfer

The storage target performs a RDMA read operation.

zero_copy - write 02 - RDMA.png

3. Target Response
  1. The zero copy application receives a response from the storage target

  2. The zero copy application forwards the request verbatim to the initiator

zero_copy - write 03 - IO Response.png

DOCA Libraries

This application leverages the following DOCA libraries:

Compiling the Application

This application is compiled as part of the set of storage applications. For compilation instructions, refer to the DOCA Storage page.

Running the Application

Application Execution

This application can only run within the NVIDIA® BlueField® DPU.

DOCA Storage Comch to RDMA Zero Copy is provided in source form. Therefore, compilation is required before the application can be executed.

  • Application usage instructions:

    Usage: doca_storage_comch_to_rdma_zero_copy [DOCA Flags] [Program Flags]
    
    DOCA Flags:
      -h, --help                        Print a help synopsis
      -v, --version                     Print program version information
      -l, --log-level                   Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
      --sdk-log-level                   Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
      -j, --json <path>                 Parse command line flags from an input json file
    
    Program Flags:
      -d, --device                      Device identifier
      -r, --representor                 Device host side representor identifier
      --cpu                             CPU core to which the process affinity can be set
      --storage-server                  Storage server addresses in <ip_addr>:<port> format
      --command-channel-name            Name of the channel used by the doca_comch_client. Default: "doca_storage_comch"
      --control-timeout                 Time (in seconds) to wait while performing control operations. Default: 5
    
    

    This usage printout can be printed to the command line using the -h (or --help) options: 

    ./doca_storage_comch_to_rdma_zero_copy -h
    

    For additional information, refer to section "DOCA Storage Comch to RDMA Zero Copy Application Guide | Command line Flags".

  • CLI example for running the application on the BlueField:

    ./doca_storage_comch_to_rdma_zero_copy -d 03:00.0 -r 3b:00.0 --storage-server 172.17.0.1:12345 --cpu 0
    

    Both the DOCA Comch device PCIe address (03:00.0) and the DOCA Comch device representor PCIe address (3b:00.0) should match the addresses of the desired PCIe devices.

    Storage target IP address:port tuples should be updated to refer to the running storage target applications.

Command-line Flags

General Flags

Short Flag

Long Flag

Description

-h

--help

Prints a help synopsis and exits

-v

--version

Prints program version information and exits

-l

--log-level

Sets the numeric log level for the application:

  • 10 – DISABLE

  • 20 – CRITICAL 

  • 30 – ERROR

  • 40 – WARNING

  • 50 – INFO

  • 60 – DEBUG

  • 70 – TRACE (requires compilation with TRACE support)

N/A

--sdk-log-level

Sets the SDK numeric log level using the same 10-70 scale as above

N/A

--log-filter

Filters logs from specific modules (comma-separated list)

-j

--json

Parses command-line flags from a specified input JSON file

Refer to DOCA Arg Parser for more information regarding the supported flags and execution modes.

Program Flags

Short Flag

Long Flag

Description

d

device

DOCA device identifier. One of:

  • PCIe address: 3b:00.0 

  • InfiniBand name: mlx5_0 

  • Network interface name: en3f0pf0sf0 

This flag is a mandatory.

r

representor

DOCA Comch device representor PCIe address

This flag is a mandatory.

N/A

--cpu

Index of CPU to use. One data path thread is spawned per CPU. Index starts at 0.

The user can specify this argument multiple times to create more threads.

This flag is a mandatory.

N/A

--storage-server

IP address and port to use to establish the control TCP connection to the target.

This flag is a mandatory.

N/A

--command-channel-name

Allows customizing the server name used for this application instance if multiple comch servers exist on the same device.

N/A

--control-timeout

Time, in seconds, to wait while performing control operations

Troubleshooting

Refer to the NVIDIA BlueField Platform Software Troubleshooting Guide for any issue encountered with the compilation, installation, or execution of the DOCA applications.

Application Code Flow

The flow of the application is broken down into key functions / steps:

C
zero_copy_app app{parse_cli_args(argc, argv)};

storage::install_ctrl_c_handler([&app]() {
    app.abort("User requested abort");
});

app.connect_to_storage();
app.wait_for_comch_client_connection();
app.wait_for_and_process_query_storage();
app.wait_for_and_process_init_storage();
app.wait_for_and_process_start_storage();
app.wait_for_and_process_stop_storage();
app.wait_for_and_process_shutdown();
app.display_stats();

Main/Control Thread Flow

  1. zero_copy_app app{parse_cli_args(argc, argv)};
    Parse CLI arguments and use these to create the application instance. Initial resources are also created at this stage: DOCA_LOG_INFO("Open doca_dev: %s", m_cfg.device_id.c_str()); m_dev = storage::open_device(m_cfg.device_id); Open a doca_dev  as specified by the CLI argument: -d  or --device DOCA_LOG_INFO("Open doca_dev_rep: %s", m_cfg.representor_id.c_str()); m_dev_rep = storage::open_representor(m_dev, m_cfg.representor_id); Open a doca_dev_rep as specified by the CLI argument: -r  or --representor m_storage_control_channel = storage::control::make_tcp_client_control_channel(m_cfg.storage_server_address); Create TCP client control channels (Control channel objects provide a unified API so that a TCP client, TCP server, doca_comch_client, and doca_comch_server all have a consistent API)See storage_common/control_channel.hpp  for more information about the control channel abstraction. m_client_control_channel = storage::control::make_comch_server_control_channel(m_dev, m_dev_rep, m_cfg.command_channel_name.c_str(), this, new_comch_consumer_callback, expired_comch_consumer_callback); Create a Comch server control channel (Containing adoca_comch_server instance) using the device, representor and channel name as specified by the CLI argument --command-channel-name or the default value if none was specified.

  2. storage::install_ctrl_c_handler([&app]() {
        app.abort("User requested abort");
    });
    Set a signal handler for Ctrl+c keyboard inputs so the app can shutdown gracefully.

  3. app.connect_to_storage();
    Connect to the TCP server hosted by the storage target as defined by the CLI argument: --storage-server 
    void zero_copy_app::connect_to_storage(void) { while (!m_storage_control_channel->is_connected()) { std::this_thread::sleep_for(std::chrono::milliseconds{100}); if (m_abort_flag) { throw storage::runtime_error{DOCA_ERROR_CONNECTION_ABORTED, "Aborted while connecting to storage"}; } } } Poll the storage target control channel until either it connects, or the user aborts the application.

  4. app.wait_for_comch_client_connection();
    Wait for a doca_comch_client to connect.
    void zero_copy_app::wait_for_comch_client_connection(void) { while (!m_client_control_channel->is_connected()) { std::this_thread::sleep_for(std::chrono::milliseconds{100}); if (m_abort_flag) { throw storage::runtime_error{DOCA_ERROR_CONNECTION_ABORTED, "Aborted while connecting to client"}; } } } Poll the Comch server control channel until a doca_comch_client has connected, or the user aborts the application. If any further Comch client attempts to connect to the server it will be automatically rejected by the control channel which is designed for a 1:1 relationship between clients and servers. A sleep is placed in this loop as it may take the user / operator a few seconds to start the client so there is no gain to polling any faster.

  5. app.wait_for_and_process_query_storage();
    Wait for the initiator to send a query_storage_request control message and then perform the required actions to fulfill the request:
    Forward the query storage request to the storage target.Wait for storage target to respond.Send a response to the initiator:Send a start_storage_response message upon success or an error_response message if anything failed

  6. app.wait_for_and_process_init_storage();
    Wait for the initiator to send a init_storage_request control message and then perform the required actions to fulfill the request:use the init_storage_payload  data to:Set core count (m_core_count) as the number of cores requested by the initiator (number of --cpu arguments provided to the initiator) OR fail if this is more than the number of --cpu  arguments provided to the service.Set number of transactions per core (m_transaction_count) to: the number of transactions requested by the initiator doubled. This is doubled to allow for batched task submission and avoid race conditions where the initiator can see a response to a transaction and try to re-submit it before the associated Comch producer event callback is received by the server meaning the initiator will continually re-try to send the task and degrade performance until the service catches up and re-submits the consumer task. This should be uncommon, but to make sure it can never happen double the transaction count is allocated so even if every single transaction on the initiator hit this issue there is a full second set of transactions ready on the service side to receive the tasks and avoid any contention. A user can experiment with reducing this value to save memory if desired).Import then re-export initiator IO blocks mmap, this allows the storage target to read / write directly to / from the initiator memory.Send init storage request to storage target using:The service transaction count (double the initiator value).The initiator core count.The re-exported IO blocks mmap.Send a response to the initiator:Send a  init_storage_response message upon success or an error_response message if anything failedPerform the first stages of the worker threads initialization. These steps are carried out for each thread, but only one thread performs the steps at any time this simplifies the sending and receiving of control messages, the user could modify this flow to execute in parallell if they so desired. Create thread bound to the Nth CPU provided to the service via the --cpu  CLI arguments m_workers[ii].execute_control_command( worker_create_objects_control_command { m_dev, m_client_control_channel->get_comch_connection(), m_transaction_count} ); Initialize thread context (asychronously) connect_rdma(ii, storage::control::rdma_connection_role::io_data, cid); Create RDMA data connections (asynchronously) The thread will connect to the storage target and create a RDMA context which will be idle from the service's point of view, but is used by the storage target to perform RDMA read / write operations. See the (3.4.0) DOCA Storage Target RDMA Application Guide for an explanation why there are two RDMA contexts per thread connect_rdma(ii, storage::control::rdma_connection_role::io_control, cid); Create RDMA data connections (asynchronously) The thread will connect to the storage target and create a RDMA context which will be used to exchange IO requests and responses using RDMA send/recv tasks. See the (3.4.0) DOCA Storage Target RDMA Application Guide for an explanation why there are two RDMA contexts per thread.

  7. app.wait_for_and_process_start_storage();
    Wait for the initiator to send a start_storage_request control message and then perform the required actions to fulfill the request:Forward the start storage request to the storage target.Wait for storage target to respond.Signal all work threads to begin data path operation.Send a response to the initiator:Send a start_storage_response message upon success or an error_response message if anything failed.

  8. Data path execution takes place now until either the user abort the program or a stop message is received.

  9. app.wait_for_and_process_stop_storage();
    Wait for the initiator to send a stop_storage_request control message and then perform the required actions to fulfill the request:Forward the stop storage request the storage target.Wait for storage target to respond.Signal all work threads to stop data path operation.Collect run time stats.Send a response to the initiator:Send a stop_storage_response message upon success or an error_response message if anything failed.

  10. app.wait_for_and_process_shutdown();
    Wait for the initiator to send a shutdown_request control message and then perform the required actions to fulfill the request:Forward the shutdown request the storage target.Wait for storage target to respond.Destroy worker thread objects.Send a response to the initiator:Send a stop_storage_response message upon success or an error_response message if anything failed.

  11. app.display_stats();
    Display runtime statistics.

  12. Application destructor is triggered:Destroy control channels.Destroy initiator IO blocks doca_mmap.Close doca_dev_rep.Close doca_dev.

  13. Program exits.

Worker/Data Path Thread Flow

The work thread proc executes in two phases: Control / configuration phase, followed by data path phase where read, write, and recovery operations take place.

Worker Init Process

C
void zero_copy_app_worker::thread_proc(zero_copy_app_worker *self, uint16_t core_idx) noexcept

The worker starts by executing a loop of:

  1. Lock mutex.

  2. If message pointer is not null:Process the configuration message.Set the operation result.

  3. Unlock the mutex.

  4. Yield.

The following configuration operations can be performed by the worker thread:

  1. void zero_copy_app_worker::create_worker_objects(worker_create_objects_control_command const &cmd)
    Create general worker objects:Create IO message memory.Create IO message mmap (to allow the messages to be accessed by DOCA comch and DOCA RDMA).Allocate doca buffer inventory.Create doca_pe to drive the DOCA contexts (doca_rdma , doca_comch_consumer, doca_comch_producer).Create doca_comch_consumerand doca_comch_producercontexts.Initialize and start contexts.

  2. void zero_copy_app_worker::export_local_rdma_connection_blob(worker_export_local_rdma_connection_command &cmd)
    Export RDMA context connection binary blob.

  3. void zero_copy_app_worker::import_remote_rdma_connection_blob(worker_import_local_rdma_connection_command const &cmd)
    Import remote RDMA context connection binary blob.

  4. void zero_copy_app_worker::are_contexts_ready(worker_are_contexts_ready_control_command &cmd) const noexcept
    Poll all contexts to check they are ready to perform data path operations.

  5. void zero_copy_app_worker::prepare_tasks(worker_prepare_tasks_control_command const &cmd)
    Prepare transaction contexts by: Allocating doca_buf objects.Allocating doca_task objects.Setting task user data.

  6. void zero_copy_app_worker::start_data_path(void)
    Break out of the wait for configuration event loop and start the data path loop.

After the configuration phase the mutex is not used again

After breaking out of the initial configuration loop, the thread submits receive tasks (comch consumer tasks, and RDMA recv tasks) then enters the data path function: run_data_path_ops.

Worker Data Path Process

C
void zero_copy_app_worker::run_data_path_ops(zero_copy_app_worker::hot_data &hot_data)
{
	DOCA_LOG_INFO("Core: %u running", hot_data.core_idx);

	while (hot_data.run_flag) {
		doca_pe_progress(hot_data.pe) ? ++(hot_data.pe_hit_count) : ++(hot_data.pe_miss_count);
	}

	while (hot_data.error_flag == false && hot_data.in_flight_transaction_count != 0) {
		doca_pe_progress(hot_data.pe) ? ++(hot_data.pe_hit_count) : ++(hot_data.pe_miss_count);
	}

	DOCA_LOG_INFO("Core: %u complete", hot_data.core_idx);
}

 During the data path phase the thread simply polls the doca_pe as quickly as possible to check for a task completion from one of the thread DOCA contexts (doca_rdma , doca_comch_consumer, doca_comch_producer) The interesting work is done in the callbacks of these tasks. The flow will always start with a consumer task completion. This is the reception of the IO message from the initiator. For the zero copy use-case the callbacks simply forward the IO requests:

  • Comch consumer → RDMA send

  • RDMA recv → Comch producer


References

  • /opt/mellanox/doca/applications/storage/

Last updated: