You need to be familiar with these sections:
LAVA is complex and administering a LAVA instance can be an open-ended task covering a wide range of skills.
At a simple level, LAVA requires a variety of Debian system administration tasks, including:
installing, upgrading and maintaining the installed packages and apt sources
configuring services outside LAVA, including:
apache - LAVA provides an example apache configuration but many instances will need to adapt this for their own hosting requirements.
DHCP - Most devices will need networking support using DHCP.
configuration management - LAVA has a variety of configuration files and a number of other services and tools will also need to be configured, for example serial console services, TFTP services and authentication services.
See also
email - LAVA can use email for notifications, if test writers include appropriate requests in the test job submissions. To send email, LAVA relies on the basic Django email support using a standard sendmail interface. Only the master needs to be configured to send email, notifications from workers are handled via the master.
The rest of the system needs updates to be applied, especially security updates. If you are upgrading a python package on an instance already running LAVA, especially if that package is directly listed as a dependency of LAVA, then all LAVA daemons should be restarted. All LAVA daemons are safe to restart without affecting running tasks. There will be a brief moment where the UI will pause but that is all:
service lava-server-gunicorn restart
service lava-publisher restart
service lava-scheduler restart
service lava-worker restart
Note
This applies to workers as well as masters but the
lava-worker
daemon has only minimal dependencies. Most of the
work is done by lava-run which gets a new process at the start of
each test job. It is NOT possible to restart lava-run - any affected
test jobs will need to be resubmitted but this is considered
unlikely.
See also
LAVA instances will need some level of infrastructure, including:
Many instances will also require specialized hardware to assist with the automation of specific devices, including switchable USB hubs or specialized relay boards.
These rules may seem harsh or obvious or tedious. However, multiple people have skipped one or more of these requirements and have learned that these steps provide valuable advice and assistance that can dramatically improve your experience of LAVA. Everyone setting up LAVA, is strongly advised to follow all of these rules.
Start with a minimal LAVA install with at most one or two devices - at this stage only QEMU devices should be considered. This provides the best platform for learning LAVA, before learning how to administer a LAVA instance.
Use the worked examples in the documentation which refer back to standard builds and proven test jobs. There will be enough to do in becoming familiar with how to fix problems and issues local to your own instance without adding the complexity of devices or kernel builds to which only you have access.
Avoid rushing to your custom device - device integration into any automated system is hard. It does not become any easier if you are trying to learn how to use the automation as well.
Plan how to test
deploy
actions and boot
actions to be able to produce reliable
results.Have at least one test instance. A single instance of LAVA is never sufficient for any important testing. Everyone needs at least one test instance in a VM or on another machine to have confidence that administrative changes will not interfere with test jobs.
Control your changes - configuration, test job definitions, test shell definitions, device dictionaries, template changes and any code changes - all need to be in version control.
Control access to the dispatcher and devices - device configuration details like the connection command and remote power commands can be viewed by all users who are able to submit to that device. In many cases, these details are sufficient to allow anyone with the necessary access to administer those devices, including modifying bootloader configuration. Only administrators should have access to any machine which itself has access to the serial console server and/or remote power control services. Typically, this will be controlled using SSH keys.
See also
Subscribe to the Mailing lists where you will find others who have setup their own LAVA instances. IRC is fine for quick queries but it is trivial to lose track of previous comments, examples and links when the channel gets busy. Mailing lists have public archives which are fully indexed by search engines. The archives will help you solve your current issue and help many others find answers for their own issues later.
There are a number of common fallacies relating to automation. Check your test ideas against these before starting to make your plans:
Seems simple enough - it doesn’t seem as if you need to deploy a new kernel or rootfs every time, no need to power off or reboot between tests. Just connect and run stuff. After all, you already have a way to manually deploy stuff to the board.
This is an over-simplification which will lead to new and unusual bugs and is
only a short step on from connect & test with many of the same problems. A
core strength of LAVA is demonstrating differences between types of devices by
controlling the boot process. By the time the system has booted to the point
where sshd
is running, many of those differences have been swallowed up in
the boot process.
ssh
can be useful within LAVA tests but using ssh
to the exclusion of
serial means that the boot process is hidden from the logs, including any
errors and warnings. If the boot process results in a system which cannot start
sshd
or cannot expose ssh
over the network, the admin has no way to
determine the cause of the failure. If the userspace tests fail, the test
writer cannot be sure that the boot process was not a partial cause of the
failure as the boot process messages are not visible. This leads to test
writers repeatedly submitting the same jobs and wasting a lot of time in triage
because critical information is hidden by the choice of using ssh
instead
of serial.
Using ssh
without a boot process at all has all the same problems as
Connect and test.
Limiting all your tests to userspace without changing the running kernel is not making the best use of LAVA. LAVA has a steep learning curve, but trying to cut corners won’t help you in the long run. If you see ssh as a shortcut, it is probable that your use case may be better served by a different tool which does not control the boot process, for example tools based on containers and virtual machines.
Note
Using serial also requires some level of automated power control. The connection is made first, then power is applied and there is no allowance for manual intervention in applying power. LAVA is designed as a fully automated system where test jobs can run reliably without any manual operations.
You’ve built an entire system and now you put the entire thing onto the device and do all the tests at the same time. There are numerous problems with this approach:
This may be true, however, automation puts extra demands on what those builds are capable of supporting. When testing manually, there are any number of times when a human will decide that something needs to be entered, tweaked, modified, removed or ignored which the automated system needs to be able to understand. Examples include:
/etc/resolv.conf
- it is common for many build tools to generate or copy
a working /etc/resolv.conf
based on the system within which the build
tool is executed. This is a frequent cause of test jobs failing due to being
unable to lookup web addresses using DNS. It is
also common for an automated system to be in a different network subnet to
the build tool, again causing the test job to be unable to use DNS due to the
wrong data in /etc/resolv.conf
.Make use of the standard files for known working device types. These files come with details of how to rebuild the files, logs of the each build and checksums to be sure the download is correct.
It is not possible to automate every test method. Some kinds of tests and some kinds of devices lack critical elements that do not work well with automation. These are not problems in LAVA, these are design limitations of the kind of test and the device itself. Your preferred test plan may be infeasible to automate and some level of compromise will be required.
This will come back to bite! However, there are other ways in which this can occur even after administrators have restricted users to limited access. Test jobs (including hacking sessions) have full access to the device as root. Users, therefore, can modify the device during a test job and it depends on the device hardware support and device configuration as to what may happen next. Some devices store bootloader configuration in files which are accessible from userspace after boot. Some devices lack a management interface that can intervene when a device fails to boot. Put these two together and admins can face a situation where a test job has corrupted, overridden or modified the bootloader configuration such that the device no longer boots without intervention. Some operating systems require a debug setting to be enabled before the device will be visible to the automation (e.g. the Android Debug Bridge). It is trivial for a user to mistakenly deploy a default or production system which does not have this modification.
Administrators need to be mindful of the situations from which users can (mistakenly or otherwise) modify the device configuration such that the device is unable to boot without intervention when the next job starts. This is one of the key reasons for health checks to run sufficiently often that the impact on other users is minimized.
The ongoing roles of administrators include:
monitor the number of devices which are online
identify the reasons for health check failures
communicate with users when a test job has made the device unbootable (i.e. bricked)
recover devices which have gone offline
restrict command line access to the dispatcher(s) and device(s) to only other administrators. This includes access to the serial console server and the remote power control service. Ideally, users must not have any access to the same subnet as the dispatchers and devices, except for the purposes of accessing devices during LAVA Hacking Sessions. This may involve port forwarding or firewall configuration and is not part of the LAVA software support.
to keep the instance at a sufficiently high level of reliability that Continuous Integration produces results which are themselves reliable and useful to the developers. To deliver this reliability, administrators do need to sometimes prevent users from making mistakes which are likely to take devices offline.
prepare and routinely test backups and disaster recovery support. Many lab
admin teams use salt
or ansible
or other configuration management
software. Always ensure you have a fast way of deploying a replacement worker
or master in case of hardware failure.
See also
Creating Backups for details of what to backup and test.
See also
When you come across problems with your LAVA instance, there are some basic information sources, methods and tools which will help you identify the problem(s).
Administrators may be asked to help with debugging test jobs or may need to use test jobs to investigate some administration problems, especially health checks.
Some failure comments in test jobs are directly related to administrative problems.
If the device dictionary contains errors, it is possible that the test job is trying to turn on power to or read serial input from the wrong ports. This will show up as a timeout when trying to connect to the device.
Note
Either the PDU command or the connection command could be wrong. If
the device previously operated normally, check the details of the power on
and connection commands in previous jobs. Also, try running the power
on
command followed by the connection command
manually (as root) on
the relevant worker.
ON
and
switching off power when reporting OFF
. It is possible for individual
relays in a PDU to fail, reporting a certain state but failing to switch
the relay when the state is reported as changing. Once a PDU starts to fail
in this way, the PDU should be replaced as other ports may soon fail in the
same manner. (Checking the light or LED on the PDU port may be
insufficient. Try connecting a fail safe device to the port, like a desk
light etc. This may indicate whether the board itself has a hardware
problem.)If the connection is refused, it is possible that the device node does not
(yet) exist on the worker. e.g. check the ser2net
configuration and the
specified device node for the port being used.
Check whether the device needs specialized support to avoid issues with power reset buttons or other hardware modes where the device does not start to boot as soon as power is applied. Check that any such support is actually working.
Dispatcher unable to meet job compatibility requirement.
The master uses the lava-dispatcher
code on the server to calculate a
compatibility number - the highest integer in the strategy classes used for
that job. The worker also calculates the number and unless these match, the job
is failed.
The compatibility check allows the master to detect if the worker is running older software, allowing the job to fail early. Compatibility is changed when existing support is removed, rather than when new code is added. Admins remain responsible for ensuring that if a new device needs new functionality, the worker will need to be running updated code.
See also
Missing methods and Python traceback messages. Also the developer documentation for more information on how developers set the compatibility for test jobs.
Check the contents of /etc/lava-coordinator/lava-coordinator.conf
on the
worker. If you have multiple workers, all workers must have coordinator
configuration pointing at a single lava-coordinator which serves all workers
on that instance (you can also have one coordinator for multiple instances).
Check the output of the lava-coordinator
logs in
/var/log/lava-coordinator.log
.
Run the status check script provided by lava-coordinator
:
$ /usr/share/lava-coordinator/status.py
status check complete. No errors
Use the example test jobs to distinguish between administration errors and test job errors. Simplify and make your test conditions portable. MultiNode is necessarily complex and can be hard to debug.
index:: jinja2 template administration
LAVA uses Jinja2 to allow devices to be configured using common data blocks,
inheritance and the device-specific device dictionary. Templates are
developed as part of lava-server
with supporting unit tests:
lava-server/lava_scheduler_app/tests/device-types/
Building a new package using the developer scripts will cause the updated templates to be installed into:
/etc/lava-server/dispatcher-config/device-types/
The jinja2 templates support conditional logic, iteration and default arguments
and are considered as part of the codebase of lava-server
. Changing the
templates can adversely affect other test jobs on the instance. All changes
should be made first as a developer. New
templates should be accompanied by new unit tests for that template.
Note
Although these are configuration files and package updates will
respect any changes you make, please talk to us
about changes to existing templates maintained within the lava-server
package.
lava-scheduler - controls how all devices are assigned:
/var/log/lava-server/lava-scheduler.log
lava-worker - controls the operation of the test job on the worker. Includes details of the test results recorded and job exit codes. Logs are created on the worker:
/var/log/lava-dispatcher/lava-worker.log
apache - includes XML-RPC logs:
/var/log/apache2/lava-server.log
gunicorn - details of the WSGI operation for django:
/var/log/lava-server/gunicorn.log
slave logs are transmitted to the master - temporary files used by the testjob are deleted when the test job ends.
job validation - the master retains the output from the validation of the
testjob performed by the slave. The logs is stored on the master as the
lavaserver
user - so for job ID 4321:
$ sudo su lavaserver
$ ls /var/lib/lava-server/default/media/job-output/job-4321/description.yaml
other testjob data - also stored in the same location on the master
are the complete log file (output.yaml
) and the logs for each specific
action within the job in a directory tree below the pipeline
directory.
See also
lava-coordinator.conf - /etc/lava-coordinator/lava-coordinator.conf
contains the lookup information for workers to find the lava-coordinator
for MultiNode test jobs. Each worker must share a single
lava-coordinator
with all other workers attached to the same instance.
Instances may share a lava-coordinator
with other instances or can choose
to have one each, depending on expected load and maintenance priorities. The
lava-coordinator
daemon itself does not need to be installed on a master
but that is the typical way to use the coordinator.
Caution
Restarting lava-coordinator
will cause errors for any
running MultiNode test job. However, changes to
/etc/lava-coordinator/lava-coordinator.conf
on a worker can be made
without needing to restart the lava-coordinator
daemon itself.
Files and directories in /etc/lava-dispatcher/
:
Files and directories in /etc/lava-server/
:
dispatcher.d - worker specific configuration. Files in this directory
need to be created by the admin and have a filename which matches the
reported hostname of the worker in /var/log/lava-server/lava-master.log
.
See also
dispatcher-config - contains V2 device configuration, including Device type templates and V2 health checks.
env.yaml - Configures the environment that will be used by the server and
the dispatcher. This can be used to modify environment variables to support a
proxy or other lab-specific requirements. The file is part of the
lava-server
package and contains comments on how changes can be made.
instance.conf - Local database configuration for the master. This file is managed by the package installation process.
lava-server-gunicorn.service - example file for a systemd service to run
lava-server-gunicorn
instead of letting systemd generate a service file
from the sysvinit support included in the package.
secret_key.conf - This key is used by Django to ensure the security of various cookies and # one-time values. To learn more please visit: https://docs.djangoproject.com/en/1.8/ref/settings/#secret-key.
settings.conf - Instance-specific settings used by Django and lava-server including authentication backends, branding support and event notifications.
See also
Some device configuration can be overridden without making changes to the Jinja2 Templates. This does require some understanding of how template engines like jinja2 operate.
To identify which variables can be overridden, check the template for placeholders. A commonly set value for QEMU device types is the amount of memory (on the dispatcher) which QEMU will be allowed to use for each test job:
- -m {{ memory|default(512) }}
Most administrators will need to set the memory
constraint in the
device dictionary so that test jobs cannot allocate all the available
memory and cause the dispatcher to struggle to provide services to other test
jobs. An example device dictionary to override the default (and also prevent
test jobs from setting a different value) would be:
{% extends 'qemu.jinja2' %}
{% set memory = 1024 %}
Admins need to balance the memory constraint against the number of other devices on the same dispatcher. There are occasions when multiple test jobs can start at the same time, so admins may also want to limit the number of emulated devices on any one dispatcher to the number of cores on that dispatcher and set the amount of memory so that with all devices in use there remains some memory available for the system itself.
Most administrators will not set the arch
variable of a QEMU device so
that test writers can use the one device to run test jobs using a variety of
architectures by setting the architecture in the job context. The QEMU
template has conditional logic for this support:
{% if arch == 'arm64' or arch == 'aarch64' %}
qemu-system-aarch64
{% elif arch == 'arm' %}
qemu-system-arm
{% elif arch == 'amd64' %}
qemu-system-x86_64
{% elif arch == 'i386' %}
qemu-system-x86
{% endif %}
Note
Limiting QEMU to specific architectures on dispatchers which are not able to safely emulate an x86_64 machine due to limited memory or number of cores is an advanced admin task. Device tags will be needed to ensure that test jobs are properly scheduled.
The dispatcher uses a variety of constants and some of these can be overridden in the device configuration.
A common override used when operating devices on your desk or when a
PDU is not available, allows the dispatcher to recognize a soft reboot.
Another example is setting up the kernel starting message that the LAVA will
recognize during boot time.
This uses the shutdown_message
and boot_message
keys in the
constants
section of the device config:
{% extends 'my-device.jinja2' %}
{% set shutdown_message = "reboot: Restarting system" %}
{% set boot_message = "Booting Linux" %}
Some of the constants can also be overridden in the test job definition, i.e.
looking at the same example shutdown-message
parameter support in the
u-boot
boot action:
- boot:
method: u-boot
commands: ramdisk
parameters:
shutdown-message: "reboot: Restarting system"
prompts:
- 'linaro-test'
timeout:
minutes: 2
Note
If you are considering using MultiNode in your Test Plan, now is the time to ensure that MultiNode jobs can run successfully on your instance.
Once you have a couple of QEMU devices running and you are happy with how to maintain, debug and test using those devices, start adding known working devices. These are devices which already have templates in:
/etc/lava-server/dispatcher-config/device-types/
The majority of the known device types are low-cost ARM developer boards which are readily available. Even if you are not going to use these boards for your main testing, you are recommended to obtain a couple of these devices as these will make it substantially easier to learn how to administer LAVA for any devices other than emulators.
Physical hardware like these dev-boards have hardware requirements like:
Understanding how all of those bits fit together to make a functioning LAVA instance is much easier when you use devices which are known to work in LAVA.
Early admin stuff:
See also
Users and groups can be added and modified in the Django administration interface or from the command line.
Newly created users will need permission to submit test jobs. This can be done
by adding the user to a group which already has the Can cancel or resubmit
test jobs
permission or by adding this permission for each individual user.
Local django user accounts can be created with the manage users
command:
$ sudo lava-server manage users add <username> --passwd <password>
If --passwd
is omitted, a random password is generated and output by
the script.
See $ sudo lava-server manage users add --help
for more information
and available options.
If Configuring user authentication is configured, users can be added directly from LDAP, retaining the configured LDAP password and email address:
$ sudo lava-server manage addldapuser --username {username}
Local Django groups can be created with the manage groups
command:
$ sudo lava-server manage groups add <name>
See $ sudo lava-server manage groups add --help
or $ sudo lava-server
manage groups update --help
for more information and available options.
See also
A device can be linked to two kinds of users or groups:
Only one user or one group can be set for the owner or for physical access at any one time.
Devices can be modified in the Django administration interface or from the command line. An existing user can be listed as the owner or the user with physical access to a specified device which must already exist:
$ sudo lava-server manage devices update {hostname} --owner {username}
$ sudo lava-server manage devices update {hostname} --physical-user {username}
Once at least one group has been created, the owner and physical access details can also be set as groups:
$ sudo lava-server manage devices update {hostname} --group {group_name}
$ sudo lava-server manage devices update {hostname} --physical-group {group_name}