Abstract: We study the optimal parallelization strategy of large language models (LLMs) and demonstrate that LLM training workloads generate sparse communication patterns in the network. Consequently, ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results