The Open Cloud Testbed builds on these three projects. Hardware is located in the MGHPCC. A researcher runs the Vitis FPGA development tools in the MOC, and then requests hardware from CloudLab to deploy and run their experiments, as shown in Figure 1. OCT provides FPGAs to the user. There are currently 8 Xilinx Alveo U280s available, each connected to its own host via PCIe and each with two 100 Gbit/s connections directly to the data switch using QSFP28 transceivers. Eight more will be deployed in the near future.
What is different about OCT? It gives users direct access to FPGAs in the cloud and allows them to experiment with the entire setup, including the OS. While development tools are provided, the deployed system is bare metal, giving the user complete freedom to conduct their research. FPGAs are directly connected to the network, enabling direct FPGA-to-FPGA connections and smart NIC based experiments. FPGAs can communicate with the network using either the
TCP/IP stack provided by ETH Zürich, or the
UDP stack provided by Xilinx. The OCT model is a combination of PaaS and IaaS, where infrastructure in the form of tools for development are made available, and coupled with bare metal deployment. This FPGA offering is much more flexible than that available from AWS, which does not include network attached FPGAs, or from the Microsoft Catapult project, who does not allow users to directly program the FPGAs.
Initial results show the advantage of direct network FPGA-to-FPGA communications. Using an example
benchmark from Xilinx, the measured round-trip time (RTT) is around 1 microsecond. The RTT is measured by starting a counter just before the packet is sent by FPGA 1, and stopping it once the same packet is received back at FPGA 1 after it being sent to FPGA 2. We also measured the OpenCL kernel execution time for an application running between two FPGAs where the communication goes from Host 1 to FPGA 1 over PCIe, then FPGA 1 to FPGA 2 over the network. The packet is then looped back from FPGA 2 to FPGA 1 which receives the packet and transmits it back to Host 1. The kernel execution time between sending and receiving a packet was observed to be between 200~300 microseconds. The high latency in this case is due to OpenCL function calls and the overhead of going through the PCIe connection. The direct FPGA-to-FPGA communication is 1 microsecond. These results argue for disaggregating FPGAs and host computers and treating the FPGA as a first-class citizen in the cloud.
OCT documentation is available on
GitHub where we host getting-started tutorials for both MOC and CloudLab. These tutorials demonstrate the workflow for stand-alone and network-attached accelerator development and deployment from the ground up.
Security in the Cloud
The bare-metal approach used in OCT gives the research community maximum freedom for system experimentation and evaluation but also bears certain risks when it comes to security. The provision of bare-metal servers gives users access to all components of the system, which means that they can also compromise (by accident or on purpose) the firmware of the system. Such modifications can impact the security of the system and subsequent users will work on a compromised system without being aware of it. OCT uses the mechanisms of Elastic Secure Infrastructure (ESI)
\cite{others2018} and its attestation service to provide an uncompromised system and make it available to an experimenter. Currently, ESI provides this service only for the servers that house the FPGAs in OCT. To ensure that a new user receives an uncompromised FPGA with the start of every new lease of a bare-metal system, we enforce the execution of a procedure that is automatically performed at the startup of the host server. This procedure makes sure the FPGA is put into a known state and no information accidentally or intentionally left behind from an earlier user continues to reside on it. An official
Xilinx Run Time (XRT) is installed as is the hardware shell. The shell is only reinstalled if the new user wishes to change the version, however all previous user logic is removed. In addition, the network that the FPGAs directly connect to is isolated to guarantee that a networked FPGA cannot inject spurious packets into a production network.
Example Application: Machine Learning on FPGAs in the Cloud
We provide sample applications including the FINN framework for machine learning. FINN \cite{Blott_2018} is developed and maintained by Xilinx Research Labs to explore deep neural network inference on FPGAs. The FINN compiler is used to create data-flow architectures that can be parallelized across and within different layers of a neural network, and transform the data-flow architecture to a bit file that can be run on FPGA hardware.
With the amount of resources available in the OCT, we are particularly interested in implementing network-attached FINN accelerators split across multiple FPGAs with convolutional neural network types such as MobileNet and ResNet, whose partitioning is discussed by Alonso at al. \cite{alonso2021elastic}. Figure 2 shows an arrangement of this where MobileNet is implemented with three accelerators that are mapped to two FPGAs. Two of the accelerators function stand-alone, while the third, which contains all the communications required between the FPGAs, is split between two Xilinx U280s. The two halves of this accelerator are connected using the network infrastructure of the UDP stack, which enables communication between them via the switch.