DOCA SDK Documentation

DOCA Storage Zero Copy Initiator Comch Application Guide


Introduction

DOCA Storage Zero Copy Initiator Comch application (initiator_comch) plays the following roles:

  1. Demonstrates how to utilize the DOCA Comch API (client-server) to communicate configuration between the x86 host and BlueField

  2. Demonstrates how to utilize the DOCA Comch API (producer-consumer) and hardware acceleration to offload the efficient transfer of messages between the x86 host and BlueField in the data path.

  3. Provides a benchmark for the performance of such an application/use case.

System Design

DOCA Storage Zero Copy Initiator Comch application creates an area of local memory and a set of message buffers to instruct the doca_storage_zero_copy_comch_to_rdma (comch_to_rdma) application to perform read and write operations to and from the created local memory region. The initiator_comch application is responsible for providing the memory region details and access details to comch_to_rdma. The initiator_comch application has no knowledge of the specifics of the doca_storage_zero_copy_target_rdma (target_rdma) application and is not directly involved with the actions required to carry out the RDMA operations to affect the transfer of data to and from target_rdma.

Data path objects are created per thread and, to maintain simplicity, a single memory region is used and each each thread and its IO message will refer to a different segment of the single exported memory region. Ensuing each thread uses a separate region of the exported memory removes the complexity of multi-threaded access to the memory. If desired, users may choose to expand the application to support multiple unique memory regions so there is one per thread.

initiator_system_design.png

Application Architecture

DOCA Storage Zero Copy Initiator Comch executes in three stages:

  1. Preparation.

  2. Data path.

  3. Teardown.

Preparation Stage

During this stage the application performs the following:

  1. Allocates the required DOCA objects and memory for the control path.

  2. Creates a DOCA Comch client and connects to comch_to_rdma.

  3. Sends a "configure data path" control message (buffer count, buffer size, doca_mmap export details) to comch_to_rdma.

  4. Waits for a configure data data path control message response from comch_to_rdma.

  5. Creates data path objects.

  6. Sends a "start data path connections" control message to comch_to_rdma.

  7. Waits for a "start data path" control message response from comch_to_rdma.

  8. Populates all IO messages with the necessary data.

  9. Sends a "start storage" control message to comch_to_rdma.

  10. Waits for a start storage control message response from comch_to_rdma.

initiator_preperation_stage.png

Data Path Stage

The data path state serves as both an example and a built-in benchmark and uses only data path objects. No control path objects or code are used during this stage.

The benchmark begins by submitting all tasks as quickly as possible to start all the transactions, then the progress engine (PE) is polled as quickly as possible. Each thread executes the same data path function. As each task completes, it decrements the transaction reference count. Once this value reaches 0, the transaction can start again. This is required as there are no temporal guarantees between DOCA Comch producer and consumer event callbacks. It is possible to be notified of the consumer completion before being notified of the producer send completion. Once a thread has completed its required number of transactions (the total transaction run limit as specified by: --run-limit-operation-count divided by the number of threads), that thread exits. Once all threads have joined, the application proceeds to send a stop IO message and moves onto the teardown phase.

initiator_data_path_stage.png

Teardown Stage

To teardown, the application performs the following:

  1. Displays execution statistics.

  2. Sends a "destroy objects" control message to comch_to_rdma.

  3. Waits for a destroy objects control message response from comch_to_rdma.

  4. Destroys data path objects.

  5. Destroys control path objects.

  6. Destroys any other allocated memory/objects.

DOCA Libraries

This application leverages the following DOCA libraries:

Compiling the Application

This application is compiled as part of the set of storage zero copy applications. For compilation instructions, refer to NVIDIA DOCA Storage Zero Copy.

Running the Application

Application Execution

This application can only be run on the host.

DOCA Storage Zero Copy Initiator Comch is provided in source form. Therefore, a compilation is required before the application can be executed.

  • Application usage instructions: 

    Usage: doca_storage_zero_copy_initiator_comch [DOCA Flags] [Program Flags]
    
    DOCA Flags:
      -h, --help                        Print a help synopsis
      -v, --version                     Print program version information
      -l, --log-level                   Set the (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
      --sdk-log-level                   Set the SDK (numeric) log level for the program <10=DISABLE, 20=CRITICAL, 30=ERROR, 40=WARNING, 50=INFO, 60=DEBUG, 70=TRACE>
      -j, --json <path>                 Parse all command flags from an input json file
    
    Program Flags:
      -d, --device                      Device identifier
      --operation                       Operation to perform. One of: read|write
      --run-limit-operation-count       Run N operations then stop
      --cpu                             CPU core to which the process affinity can be set
      --per-cpu-buffer-count            Number of memory buffers to create. Default: 64
      --buffer-size                     Size of each created buffer. Default: 4096
      --validate-writes                 Enable validation of writes operations by reading them back afterwards. Default: false
      --command-channel-name            Name of the channel used by the doca_comch_client. Default: storage_zero_copy_comch
      --control-timeout                 Time (in seconds) to wait while performing control operations. Default: 10
      --batch-size                      Batch size: Default: ${per-cpu-buffer-count} / 2
    


    This usage printout can be printed to the command line using the -h (or --help) options:

    ./doca_storage_zero_copy_initiator_comch -h
    

    For additional information, refer to section "DOCA Storage Zero Copy Initiator Comch Application Guide | id (2.9.3 LTS U1)DOCAStorageZeroCopyInitiatorComchApplicationGuide CommandLineFlags".


  • CLI example for running the application on the host:

    ./doca_storage_zero_copy_initiator_comch -d 3b:00.0 --operation read --run-limit-operation-count 10000000 --cpu 5
    


    The DOCA device PCIe address, 3b:00.0, should match the address of the desired PCIe device.


  • The application also supports a JSON-based deployment mode, in which all command-line arguments are provided through a JSON file:

    ./doca_storage_zero_copy_initiator_comch --json [json_file]
    

    For example:

    ./doca_storage_zero_copy_initiator_comch --json doca_storage_reference_zero_copy_host_params.json
    


    Before execution, ensure that the used JSON file contains the correct configuration parameters, and especially the PCIe addresses necessary for the deployment.


Command Line Flags

Flag Type

Short Flag

Long Flag/JSON Key

Description

JSON Content

General flags

h

help

Print a help synopsis

N/A

v

version

Print program version information

N/A

l

log-level

Set the log level for the application:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70 (requires compilation with TRACE log level support)


"log-level": 60


N/A

sdk-log-level

Set the log level for the program:

  • DISABLE=10

  • CRITICAL=20

  • ERROR=30

  • WARNING=40

  • INFO=50

  • DEBUG=60

  • TRACE=70


"sdk-log-level": 40


j

json

Parse all command flags from an input JSON file

N/A

Program flags

d

device

DOCA device identifier. One of:

  • PCIe address: 3b:00.0 

  • InfiniBand name: mlx5_0 

  • Network interface name: en3f0pf0sf0 

This flag is mandatory.





"device": "3b:00.0"


N/A

operation

Operation to perform either read  or write 

This flag is mandatory.



"operation": "read"


N/A

--run-limit-operation-count

Run N operations (transactions) then stop

This flag is mandatory.



"run-limit-operation-count": 10000000


N/A

--cpu

Index of CPU to use. One data path thread is spawned per CPU. Index starts at 0.

The user can specify this argument multiple times to create more threads.


This flag is mandatory.



"cpu": 6


N/A

--per-cpu-buffer-count

Number of buffers (all buffers execute in parallel) to use per CPU


"per-cpu-buffer-count": 64


N/A

--buffer-size

Size of buffer to use for data transfers. Should be a value representative of a disk block size.


"buffer-size": 4096


N/A

--validate-writes

Run a functional test instead of a performance test. Only compatible with write operation mode.


"validate-writes": true


N/A

--command-channel-name

Allows customizing the server name used for this application instance if multiple comch servers exist on the same device


"command-channel-name": "storage_zero_copy_comch"


N/A

--control-timeout

Timeout (in seconds) for a control operation to complete. If any control operation exceeds this time, the application aborts.


"control-timeout": 10


N/A

--batch-size

Batch size to use when submitting tasks using the batched API


"batch-size": 8


Troubleshooting

Refer to the DOCA Troubleshooting for any issue encountered with the installation or execution of the DOCA applications.

Application Code Flow

Control Thread Flow

  1. Parse application arguments:

    C++
    auto const cfg = parse_cli_args(argc, argv);
    
    1. Prepare the parser (doca_argp_init).

    2. Register parameters (doca_argp_param_create).

    3. Parse the arguments (doca_argp_start).

    4. Destroy the parser (doca_argp_destroy).

  2. Display the configuration:

    C++
    print_config(cfg);
    


  3. Create application instance

    C++
    g_app.reset(storage::zero_copy::make_host_application(cfg));
    


  4. Run the application:

    C++
    g_app->run()
    
    1. Find and open the specified device:

      C++
      m_dev = storage::common::open_device(m_cfg.device_id);
      


    2. Create control path progress engine:

      C++
      doca_pe_create(&m_ctrl_pe);
      


    3. Create comch control objects:

      C++
      create_comch_control();
      


    4. Connect to comch server:

      C++
      connect_comch_control();
      


    5. Configure storage:

      C++
      configure_storage();
      
      1. Allocate local memory region.

      2. Create doca_mmap.

      3. Send configure data path control message to comch_to_rdma.

      4. Wait for a configure data path control message response from comch_to_rdma.

    6. Prepare data path

      C++
      prepare_data_path();
      
      1. Create per thread data context:

        1. Create IO messages.

        2. Create transaction objects.

        3. Create progress engine.

        4. Create mmap for IO message buffers.

        5. Create Comch producer.

        6. Create Comch consumer.

      2. Send start data path connections control message to comch_to_rdma.

      3. Wait for a start data path connections control message response from comch_to_rdma.

      4. Poll progress engine until:

        1. remote consumer ID values have been received.

        2. All consumers are running.

        3. All producers are running.

    7. Create tasks:

      C++
      m_thread_contexts[ii].create_tasks(m_raw_io_data + (ii * per_thread_task_count * m_cfg.buffer_size),
      				   m_cfg.buffer_size,
      				   m_remote_consumer_ids[ii],
      				   op_type,
      				   m_cfg.batch_size);
      


    8. Create threads:

      C++
      if (op_type == io_message_type::read) {
      	m_thread_contexts[ii].thread = std::thread{&thread_hot_data::non_validated_test,
      						   std::addressof(m_thread_contexts[ii].hot_context)};
      } else if (op_type == io_message_type::write) {
      	if (m_cfg.validate_writes) {
      		m_thread_contexts[ii].thread =
      			std::thread{&thread_hot_data::validated_test,
      				    std::addressof(m_thread_contexts[ii].hot_context)};
      	} else {
      		m_thread_contexts[ii].thread =
      			std::thread{&thread_hot_data::non_validated_test,
      				    std::addressof(m_thread_contexts[ii].hot_context)};
      	}
      }
      


    9. Start the data path:

      C++
      wait_for_control_response(send_control_message(control_message_type::start_storage));
      


    10. Record the start time.

    11. Submit initial DOCA Comch consumer tasks.

    12. Start data path threads.

    13. Wait for all threads to complete.

    14. Record the end time.

    15. Stop storage.

    16. Shutdown.

  5. Display statistics:

    C++
    printf("+================================================+\n");
    printf("| Stats\n");
    printf("+================================================+\n");
    printf("| Duration (seconds): %2.06lf\n", duration_secs_float);
    printf("| Operation count: %u\n", stats.operation_count);
    printf("| Data rate: %.03lf GiB/s\n", GiBs / duration_secs_float);
    printf("| IO rate: %.03lf MIOP/s\n", miops);
    printf("| PE hit rate: %2.03lf%% (%lu:%lu)\n",
            pe_hit_rate_pct,
            stats.pe_hit_count,
            stats.pe_miss_count);
    printf("| Latency:\n");
    printf("| \tMin: %uus\n", stats.latency_min);
    printf("| \tMax: %uus\n", stats.latency_max);
    printf("| \tMean: %uus\n", stats.latency_mean);
    printf("+================================================+\n");
    


Performance Data Path Thread Flow

  1. Start transactions:

    C++
    for (uint32_t ii = 0; ii != transactions_size; ++ii)
    	start_transaction(transactions[ii], std::chrono::steady_clock::now());
    


  2. Run until N operations have been completed:

    C++
    while (run_flag) {
    	doca_pe_progress(data_pe) ? ++(pe_hit_count) : ++(pe_miss_count);
    }
    


Functional Data Path Thread Flow

  1. Determine the number of iterations to execute (each iteration is up to --per-cpu-buffer-count transactions):

    C++
    uint32_t const iteration_count =
    		(remaining_tx_ops / transactions_size) + ((remaining_tx_ops % transactions_size) == 0 ? 0 : 1);
    


  2. For each iteration:

    1. Set data in local memory region to a fixed pattern.

    2. Set all transactions to write mode:

      C++
      void thread_hot_data::set_operation(io_message_type operation)
      {
      	for (uint32_t ii = 0; ii != transactions_size; ++ii) {
      		auto *io_message = const_cast<char *>(storage::common::get_buffer_bytes(
      			doca_comch_producer_task_send_get_buf(transactions[ii].request)));
      		
      	}
      }
      


    3. Start all transactions.

    4. Poll the PE until all transactions complete.

    5. Set data in local memory region to an alternative fixed pattern.

    6. Set all transactions to read mode.

    7. Start all transactions.

    8. Poll the PE until all transactions complete.

    9. Validate that all data in local memory region has been modified and reflects the original data pattern and not the alternative pattern.

References

  • /opt/mellanox/doca/applications/storage/

Last updated: