Introduction: As the adoption of Machine Learning (ML) continues to grow across various industries, concerns regarding its environmental impact and power consumption have emerged. It is projected that in the next five years, the market for AI by 10x into hundreds of billions of dollars [1].
Objective: The Graph Attention Network () is a neural network architecture designed for graph-structured data, where networks utilize masked self-attentional layers, addressing the limitations of previous graph convolution methods. Unlike CNNs, which work well with grid-like data (e.g., images), GATs are designed for irregular data structures such as social networks, biological networks, and 3D meshes, allowing nodes to focus on different parts of their neighborhoods.
The aim of this report is to analyze the energy consumption of the Graph Neural Network (GNN) benchmark on different hardware, compare executions of the benchmark of two different sized datasets, investigate power usage of separate components of the algorithm, and suggest how ML models can be trained efficiently but sustainably.
Background: The GNN benchmark represents a multi-class node classification task carried out using a GAT model and the , which is also a graph-based dataset, namely a citation graph.
For power usage measurement, I chose , a Python library that enables developers to track carbon dioxide (COâ‚‚) emissions across machine learning experiments or other programs. It is easy to integrate into the workflow and has considerably low overhead.
Carbon dioxide emissions (COâ‚‚eq) can be calculated as C * E where
C = Carbon Intensity of the electricity consumed for computation: quantified as grams of COâ‚‚ emitted per kilowatt-hour of electricity.
E = Energy Consumed by the computational infrastructure: quantified as kilowatt-hours.
Carbon Intensity of the consumed electricity is calculated as a weighted average of the emissions from the different energy sources that are used to generate electricity:
The experiments were conducted on Ã山ǿ¼é’s gpu-teach-01 node, and Compute Canada’s Beluga and Narval clusters. The two datasets used to train the GAT model were ‘tiny’ and ‘small’ sized, as available on the benchmark’s .
Beluga and Narval nodes have four GPUs each, with Nvidia Tesla V100s and A100s respectively. Gpu-teach-1 has Nvidia RTX A2000s (but multi-GPU usage is disabled).
Common Findings
Across all three machines and all GPU setups, batch training has taken up almost all of the power consumed by the system during execution of the benchmark. The model rapidly gets more accurate within the first 5-10 batches and then eventually almost converges to the benchmarked accuracy value of 0.72.
Above, we see that the model on all three machines consumes more power as it becomes better at classification. On gpu-teach-01, the model tends to get closer to the benchmarked value of accuracy (0.72) at a lower power usage compared to the other machines. We can deduce that there is a positive relationship between validation accuracy and power usage of the system.
How could we attempt to lower power usage? Three parameters involved are accuracy, power usage, and time taken to train the model. We would like our model to be highly accurate, and train at low energy costs. Making an effort to bring down training times is certainly worth exploring.
Optimizations
We could try using many GPUs to train our model and hopefully lower training times:
Comment on multi-GPU usage: On Beluga and Narval, there does not seem to be any advantage to using more than one GPU on the tiny dataset: both training time and energy used increases on both machines. On the small dataset, although Beluga’s training times barely showed a decreasing trend with more GPUs being involved, we saw an optimal trend on Narval, with training time decreasing by about 50% and energy almost staying the same. What explains the scenario above?
Using more than one GPU does not necessarily result in increased performance if the model is not large or complex enough (tiny dataset being 2.97GB and small dataset 12.7GB only). This is mainly because the GPUs must synchronize at each gradient update step during training. This synchronization happens to be relatively expensive in time and that means it introduces a large overhead as one repeats that gradient update operation multiple times over multiple epochs. Therefore, the notion that energy or time would decrease proportionally with number of GPUs involved might not always apply.
The 498G memory nodes on Narval allowed for larger batch sizes to be loaded on the GPUs, and this may have compensated for the bottlenecks on the small dataset’s parallel training.
An important takeaway is to investigate when parallelized training might be beneficial in reducing training times and bringing down energy costs, and when it might be best to avoid it.
Three-way comparison (on small dataset- single GPU use case)
Gpu-teach-01 was the most energy-efficient even though it was the slowest. Narval was the fastest at training the model. This difference may be attributed to the varying capabilities of the Nvidia RTX A2000, Tesla V100, and A100 on gpu-teach-01, Beluga and Narval respectively, and the optimal batch sizes used: 192, 128, and 512 respectively.
Batch size can also be a hyperparameter and carefully tuning it depending on the nature of the model and the hardware specifics might significantly lower training times and consequently energy usage. The following shows the results of training the model on batch sizes 32, 64, 128, and 196 using one GPU:
We see that the optimal batch size configurations vary from machine to machine.
Limitations
Gpu-teach-01 did not have multi-GPU usage enabled, and I often experienced issues with its filesystem; Ã山ǿ¼é IT was extremely prompt and supportive in resolving those issues. Beluga had 16GB and Gpu-teach-01 12GB VRAM respectively whereas Narval had 40GB on each GPU. It would have been interesting to observe how each machine would perform if their GPUs could load higher batch sizes during training.
Conclusion
To address the adverse effects of AI’s rising carbon footprint, the efficacy of employing multi-GPU training could be studied and batch size configurations should be optimized in efforts to quickly train learning systems, keep operational energy costs minimal, and strive for sustainability.
I would like to thank my supervisor Dr. Bettina Kemme for the constant mentorship throughout the project, and graduate student Lancelot Normand for the Compute Canada onboarding. I would also like to thank Arts Internship Office for supporting my research.