Abstract: We study the optimal parallelization strategy of large language models (LLMs) and demonstrate that LLM training workloads generate sparse communication patterns in the network. Consequently, ...