I was in the process of writing the next blog post on my initial experiments testing libraries for programming language detection, when I noticed a minor detail on environment setup. Inspired, I’m writing this post to share some details on how to “control” the hardware environment to create an experiment that is as replicable and consistent as possible.

Initially requesting hardware resources

When beginning the experiment, I went by several criteria (which I will share in more detail in a subsequent post) to select the machine(s) which I was to use on Grid5000, an HPC commonly used by computer science researchers in France. Rather than select a particular machine, I selected based on my criteria and provided a request for a particular configuration of resources:

For example

more than 200 Gb memory
2 x A40 GPUs available.

In my case, I was looking for

4 x Nvidia Tesla V100 (32 GiB)

to satisfy my speed needs when running machine learning.

After briefly checking the machines available to me, I saw that there were a handful fitting my criteria; these are the ones which I was assigned to work with every times.

This information is available on Grid5000 hardware descriptions [1].

Cluster	Access Condition	Date of arrival	Manufacturing date	Nodes	CPU	CPU Name	CPU Cores	CPU Architecture	Memory	Storage	Network	Accelerators
abacus9	abaca queue	2023-11-06	2018-12-12	1	2	Intel Xeon Silver 4114	10 cores/CPU	x86_64	192 GiB	239 GB HDD	10 Gbps	4 x Nvidia Tesla V100 (32 GiB)
abacus14	abaca queue	2023-11-08	2019-09-26	1	2	Intel Xeon Silver 4214	12 cores/CPU	x86_64	384 GiB	240 GB SSD + 240 GB SSD	10 Gbps	4 x Nvidia Tesla V100 (32 GiB)
abacus16	abaca queue	2023-10-16	2019-09-26	1	2	Intel Xeon Silver 4214	12 cores/CPU	x86_64	384 GiB	239 GB HDD	10 Gbps	4 x Nvidia Tesla V100 (32 GiB)
abacus28	abaca queue	2025-02-05	2020-12-05	1	2	Intel Xeon Silver 4216	16 cores/CPU	x86_64	384 GiB	480 GB SSD	10 Gbps	4 x Nvidia Tesla V100 (32 GiB)

Hence, knowing that I’d specified a particular number of GPUs, I was at peace that my experiment was consistent across several machines with the same apparent configuration, particularly since reserving a similar machine in reasonable time on Grid5000 is a persistant challenge (another item that I’ll mention in a subsequent post).

Discovering that not all hardware configurations are the same

Even from the table above extracted from the Grid5000 resource descriptions, it’s notable that the CPU cores and memory are not exactly the same on each machine. I didn’t observe this at the time, especially provided that the GPUs were the most important and the differences seemingly minute.

I’d run my experiments with reliable confidence, marking my main independent variable, the GPU, as “Nvidia Tesla V100.”

What I noticed later on when browsing another resource explorer, was that these “Nvidia Tesla V100” GPUs came in multiple types as well! Now, rereading each machines’ configuration again, it’s apparent that the GPUs, as well as other configurations vary just slightly [1].

Hence, I truly had run on

Tesla V100-PCIE-32GB and
Tesla V100-SXM2-32GB

GPUs.

Each “machine” is unique

To further reinforce this, the abacus14 and the abacus16 are both

4 x Tesla V100-PCIE-32GB with
Intel Xeon Silver 4214 CPUs,

seeming to be quite similar. But, let’s be careful.

Here are their configurations from Grid5000’s information [1]:

abacus14

1 node, 2 cpus, 24 cores

Configuration	value
Access condition:	abaca queues
Model:	Dell PowerEdge C4140
Manufacturing date:	2019-09-26
Date of arrival:	2023-11-08
CPU:	Intel Xeon Silver 4214 (Cascade Lake-SP), x86_64, 2.20GHz, 2 CPUs/node, 12 cores/CPU
Memory:	384 GiB
Storage:	- disk0, 240 GB SSD SATA Micron MTFDDAV240TCB (dev: `/dev/disk0`) (primary disk) - disk1, 240 GB SSD SATA Micron MTFDDAV240TCB (dev: `/dev/disk1`)
Network:	- eth0/eno1, Ethernet, configured rate: 10 Gbps, model: Intel Ethernet Controller X710 for 10GbE SFP+, driver: i40e - eth1/eno2, Ethernet, model: Intel Ethernet Controller X710 for 10GbE SFP+, driver: i40e - unavailable for experiment
GPU:	4 x Nvidia Tesla V100-PCIE-32GB (32 GiB) Compute capability: 7.0

abacus16

1 node, 2 cpus, 24 cores

Configuration	value
Access condition:	abaca queues
Model:	Dell PowerEdge C4140
Manufacturing date:	2019-09-26
Date of arrival:	2023-10-16
CPU:	Intel Xeon Silver 4214 (Cascade Lake-SP), x86_64, 2.20GHz, 2 CPUs/node, 12 cores/CPU
Memory:	384 GiB
Storage:	disk0, 239 GB HDD RAID Dell DELLBOSS VD (dev: `/dev/disk0`) (primary disk)
Network:	- eth0/eno1, Ethernet, configured rate: 10 Gbps, model: Intel Ethernet Controller X710 for 10GbE SFP+, driver: i40e - eth1/eno2, Ethernet, model: Intel Ethernet Controller X710 for 10GbE SFP+, driver: i40e - unavailable for experiment
GPU:	4 x Nvidia Tesla V100-PCIE-32GB (32 GiB) Compute capability: 7.0

Despite our effort, their storage configurations are different. In this case, this may be a potentially minor detail, but it illustrates the importance of a “similar” configuration/machine almost never being equivalent. This is morevover important when working with microcontrollers, as most engineers who do so will assure.

In my situation, the machine’s storage may not be detrimental to the experiment’s control quality. In the reverse direction, I may not be aware of its impact/role at this time, just as I wasn’t aware that the CPUs of the several machines I’d worked with were different.

Conclusions and best practices

The principle of applying an “appropriate level of specificity” is important in research and elsewhere. But, in this case, the experiments’ independent variables are ensured consistent by either looking in greater detail at each machines full configuration or requesting the exact same machine again.

Principles for ensuring HW environment consistency in throughput research

read the machine(s)’ configurations (scrupulously) or in great detail

record the hardware configurations in every experiment run

if possible, request the same indiviudal machine for every run of an experiment (even if there are other “twin” machines which appear or claim to be the same!)

Fortunately for me, many of my runs happened on the same (yes, the same unique/individual) machine. Hence begins the process of rerunning the remaining scenarios on the same configuration/machine…

References

[1] “Grid5000,” 2025. [Online]. Available: https://www.grid5000.fr/w/Grid5000:Home

On “controlling” hardware environments in throughput experiments

Initially requesting hardware resources

Discovering that not all hardware configurations are the same

Each “machine” is unique

abacus14

abacus16

Conclusions and best practices

Principles for ensuring HW environment consistency in throughput research

References

On “controlling” hardware …

Testing Programming …

Identifying Programming …