At this point, it is expected that you will have a simple lab setup with some
virtual devices and some simple test boards, each regularly passing
health checks. It is generally a good idea to allow time
for your lab to settle. Run a number of test jobs and understand the
administrative burden before trying to expand your setup.
Once you are happy that things are working and you know how to run that simple
lab, here are suggested steps to follow to grow it.
The simplest LAVA instance is a single server with a single worker on the same
machine. Adding more devices to such an instance may quickly cause problems
with load. Test jobs may time out on downloads or decompression and devices
will go offline.
The first step in growing a LAVA lab is to add a remote worker. Remote workers
can be added to any V2 master. To do so, use the Django administration interface
to add new devices and device types, allocate some or all devices to the newly
created remote worker.
As load increases, the master will typically benefit from having fewer and
fewer devices directly attached to the worker running on the master machine.
Complex labs will typically only have devices attached to remote workers.
Depending on the workload and admin preferences, there are several lab layouts
that can make sense:
This is the starting layout for a fresh installation. Depending on the
capability of the master, this layout can support a small variety of devices
and a small number of users. This layout does not scale well. Adding too many
devices or users to this setup can lead to the highest overall maintenance
burden, per test job, of all the layouts here.
In all of these example diagrams, Infrastructure represents the extra
equipment that might be used alongside the LAVA master and workers, such as
mirrors, caching proxies etc.
A medium to large lab can operate well with a single master controlling
multiple workers, especially if the master is a dedicated server running only
lava-server
.
A custom frontend can use custom result handling to aggregate data from multiple separate masters into
a single data set. The different masters can be geographically separated and
run by different admins. This is the system used to great effect by
KernelCI.org.
When different teams need different sets of device types and
configurations and where there is little overlap between the result sets for
each team, a micro-instance layout may make sense.
The original single lab is split into separate networks, each with a separate
complete instance of a LAVA master and one or more workers. This will give each
team their own dedicated micro-instance, but the administrators of the lab can
use common infrastructure just like a single lab in a single location. Each
micro-instance can be grown in a similar way to any other instance, by adding
more devices and more workers.
The optimum configuration will depend massively on the devices and test jobs
that you expect to run. Use the multiple masters, multiple
workers option where all test jobs feed
into a single data set. Use micro-instances where teams have discrete sets of
results. Any combination of micro-instances can still be aggregated behind one
or more custom frontends to get different overviews of the results.
As an example, the Linaro LAVA lab in Cambridge is a hybrid setup. It operates
using a set of micro-instances, some of which provide results to frontends like
KernelCI.org.
LAVA V2 supports geographically separate masters and workers. Workers can be
protected behind a firewall or even using a NAT internet connection, without
the need to use dynamic DNS or other services. Connections are made from the
worker to the server, so the only requirement is that the HTTP/HTTPS ports
of the server are open to the internet.
Physically separating different workers is also possible but has implications:
- Resources need to be mirrored, cached or proxied to multiple locations.
- The administrative burden of a LAVA lab is frequently based around the
devices themselves. LAVA devices frequently require a range of support tasks
which are unsuitable for generic hosting locations. It is common that a
trained admin will need physical access to test device hardware to fix
problems. The latency involved in getting someone to the location of the
device to change a microSD card, press buttons on a problematic device,
investigate PDU failures and other admin tasks will have a large
impact on the performance of the LAVA lab itself.
- Physical separation across different sites can mean that test writers may see
varying performance according to which worker has idle devices at the time.
If one worker has a slower connection to the build system storage, test
writers will need to allow for this in the job submission timeouts, possibly
causing jobs on faster workers to spend longer waiting for the timeout to
expire.
- Each location still needs UPS
support, backup support and other common lab infrastructure as laid out
previously.
Many labs have a separate master and multiple workers with the physical
machines co-located in the same or adjacent racks. This makes it easier to
administer the lab. Sometimes, admins may choose to have the master and one or
more workers in different geographical locations. There are some additional
considerations with such a layout.
Note
One or more LAVA V2 workers will be required in the
remote location. Each worker will need to be permanently connected to all
devices to be supported by that worker. Devices cannot be used in LAVA
without a worker managing the test jobs.
Before considering installing LAVA workers in remote locations, it is
strongly recommended that read and apply the following sections:
Remember that devices need additional, often highly specialized, infrastructure
support alongside the devices. Some of this hardware is used outside the
expected design limits. For example, a typical PDU may be designed to
switch mains AC once or twice a month on each port. In LAVA, that unit will be
expected to switch the same load dozens, maybe hundreds of times per day for
each port. Monitoring and replacing this infrastructure before it fails can
have a significant impact on the ongoing cost of your proposed layout as well
as your expected scheduled downtime.
Caution
A typical datacenter will not have the infrastructure to handle
LAVA devices and is unlikely to provide the kind of prompt physical access
which will be needed by the admins.
The bootloader types used by the devices attached to a worker can have a major
impact on how many devices that worker can support. Some bootloaders are
comparatively lightweight, as they depend on the device pulling files from
the dispatcher during boot via a protocol like TFTP. This type of protocol
tends to be quite forgiving on timing while transferring files. Other
bootloaders (e.g. fastboot) work by pushing files to the device, which is
often much more demanding. Sometimes the data needs to be modified as it is
pushed and it is common that the device receiving the data cares about the
timing of the incoming data. A small delay at an inconvenient point may cause
an unexpected failure. When running multiple tests in parallel, the software
pushing the files may cause problems - it is designed to maximize the speed of
the first transfer at the expense of anything else. This “greedy” model means
that later requests running concurrently may block, thereby causing test jobs
to fail.
For this reason, we recommend that fastboot
type devices are restricted to
one device, one CPU core (not a hyperthread, a real silicon core). This may
well apply to other bootloaders which require files to be pushed to devices but
has been most clearly shown with fastboot
.
Take particular care if the worker is a virtual machine and ensure that the
VM has as many cores as it has fastboot devices.
Also be careful if running the master and worker(s) on the same physical
hardware (e.g. running as VMs on the same server). The master also has CPU
requirements: users pulling results over the API or viewing test jobs in a
browser will cause load on the master, and the database can also add more load
as the number of test jobs increases. Try to avoid putting all the workers and
the master onto the same physical hardware. Even if this setup works initially,
unexpected failures can occur later as load increases.
Pay attention to the types of failures observed. If a previously working device
starts to fail in intermittent and unexpected ways, this could be a sign that
the infrastructure supporting that worker is suffering from excess load.
All labs will need scheduled downtime. The layout of your lab will have a
direct impact on how those windows are managed across remote locations.
Maintenance will need to be announced in advance with enough time to allow test
jobs to finish running on the affected worker(s). Individual workers can have
all devices on that worker taken offline without affecting jobs on other
workers or the master. Adding a frontend adds further granularity,
allowing maintenance to occur with less visible interruption.
Use HTTPS for the connections between the server and the workers.
The worker initiates the HTTP connection to the server, so a worker will work
when behind a NAT connection. Only the address of the master needs to be
resolvable using public DNS. There is no need for the master or any other
service to be able to initiate a connection to the worker from outside the
firewall. This means that a public master can work with DUTs in a
remote location by connecting the boards to one or more worker(s) in the same
location.
If the master is behind a firewall, the HTTP/HTTPS ports will need to be open.
It is also worth considering if it will be easier to administer the various
devices by having a master alongside the worker(s) and then collating the
results from a number of different masters using a frontend.