Changes and New Features
|
Feature/Change |
Description |
|
Switches |
Added support for NVIDIA Quantum-2 switches with NDR speed |
|
Adapter Cards |
Added support for NVIDIA ConnectX-7 adapter card with 400 Gb/s speed |
|
SHARPD |
sharpd daemon process has been removed. sharpd-related activity is now performed from the user application process |
|
AM |
Upon restart of AM, it no longer needs to wait for all concurrent jobs to finish before being able to accept new jobs |
|
AM |
Added a mechanism that periodically checks for errors in Aggregation Trees and attempts to fix them |
|
General |
Added support for new data types BFLOAT16, INT8 and UNIT8 for performing reduction operations |
Parameter Changes
|
Parameter |
Component |
Description |
|---|---|---|
|
recovery_retry_interval |
sharp_am |
New parameter: A timeout in seconds for trees recovery retries. A value of 0 means do not try to recover trees. Default: 300 |
|
enable_seamless_restart |
sharp_am |
New parameter: A boolean flag. If enabled, AM tries to recover state from last AM run and continue the operation of the current jobs. Default: True |
|
seamless_restart_trees_file |
sharp_am |
New parameter: Set the SHARP trees file used in Seamless restart. Need to mention only the file name, full path is constructed using ‘dump_dir’. Default: sharp_am_trees_structure.dump |
|
seamless_restart_max_retries |
sharp_am |
New parameter: Set the number of consecutive retries of seamless restart. If seamless restart fails more times in a row, it will be disabled in the next run. Default: 3 |
|
max_tree_radix |
sharp_am |
Update: Change default to 252 |
|
Ib_sat_max_mtu |
sharp_am |
Update: Change default to 5, to support MAD value that represents 4K MTU. |
|
per_prio_default_quota |
sharp_am |
Update: Changed default to 3 instead of 20, enabling more SAT jobs to take place in parallel on each switch. |
Last updated: