This is the design document for a revamped automatic testing framework. The revamp aims at replacing the current tinderbox based testing by a new system that is written from scratch.
The old system is not easy to work with and was never meant to be used for managing tests, after all it just a simple a build manager tailored for contiguous building. Modifying the existing tinderbox system to do what we want would require fundamental changes that would render it useless as a build manager, it would therefore end up as a fork. The amount of work required would probably be about the same as writing a new system from scratch. Other considerations, such as the license of the tinderbox system (MPL) and language it is realized in (Perl), are also in favor of doing it from scratch.
The language envisioned for the new automatic testing framework is Python. This is for several reasons:
- The VirtualBox API has Python bindings.
- Python is used quite a bit inside Sun (dunno about Oracle).
- Works relatively well with Apache for the server side bits.
- It is more difficult to produce write-only code in Python (alias the we-don't-like-perl argument).
- You don't need to compile stuff.
Note that the author of this document has no special training as a test engineer and may therefore be using the wrong terms here and there. The primary focus is to express what we need to do in order to improve testing.
This document is written in reStructuredText (rst) which just happens to be used by Python, the primary language for this revamp. For more information on reStructuredText: http://docutils.sourceforge.net/rst.html
See also http://encyclopedia2.thefreedictionary.com/testing%20types and http://www.aptest.com/glossary.html .
- A scalable test manager (>200 testboxes).
- Optimize the web user interface (WUI) for typical workflows and analysis.
- Efficient and flexibile test configuration.
- Import test result from other test systems (logo testing, VDI, ++).
- Easy to add lots of new testscripts.
- Run tests locally without a manager.
- Revamp a bit at the time.
Each testbox has a unique name corresponding to its DNS zone entry. When booted a testbox script is started automatically. This script will query the test manager for orders and execute them. The core order downloads and executes a test driver with parameters (configuration) from the server. The test driver does all the necessary work for executing the test. In a typical VirtualBox test this means picking a build, installing it, configuring VMs, running the test VMs, collecting the results, submitting them to the server, and finally cleaning up afterwards.
The testbox environment which the test drivers are executed in will have a number of environment variables for determining location of the source images and other test data, scratch space, test set id, server URL, and so on and so forth.
On startup, the testbox script will look for crash dumps and similar on systems where this is possible. If any sign of a crash is found, it will put any dumps and reports in the upload directory and inform the test manager before reporting for duty. In order to generate the proper file names and report the crash in the right test set as well as prevent reporting crashes unrelated to automatic testing, the testbox script will keep information (test set id, ++) in a separate scratch directory (${TESTBOX_PATH_SCRATCH}/../testbox) and make sure it is synced to the disk (both files and directories).
After checking for crashes, the testbox script will clean up any previous test which might be around. This involves first invoking the test script in cleanup mode and the wiping the scratch space.
When reporting for duty the script will submit information about the host: OS name, OS version, OS bitness, CPU vendor, total number of cores, VT-x support, AMD-V support, amount of memory, amount of scratch space, and anything else that can be found useful for scheduling tests or filtering test configurations.
The orders are kept in a queue on the server and the testbox script will fetch them one by one. Orders that cannot be executed at the moment will be masked in the query from the testbox.
The testbox will not provide the typical unix /bin and /usr/bin utilities. In other words, cygwin will not be used on Windows!
The testbox will provide the unixy utilities that ships with kBuild and possibly some additional ones from tools/./bin in the VirtualBox tree (wget, unzip, zip, and so on). The test drivers will avoid invoking any of these utilities directly and instead rely on generic utility methods in the test driver framework. That way we can more easily reimplement the functionality of the core utilities and drop the dependency on them. It also allows us to quickly work around platform specific oddities and bugs.
The test drivers are programs that will do the actual testing. In addition to run under the testbox script, they can be executed in the VirtualBox development environment. This is important for bug analysis and for simplifying local testing by the developers before committing changes. It also means the test drivers can be developed locally in the VirtualBox development environment.
The main difference between executing a driver under the testbox script and running it manually is that there is no test manager in the latter case. The test result reporter will not talk to the server, but report things to a local log file and/or standard out/err. When invoked manually, all the necessary arguments will need to be specified by hand of course - it should be possible to extract them from a test set as well.
For the early implementation stages, an implementation of the reporter interface that talks to the tinderbox base test manager will be needed. This will be dropped later on when a new test manager is ready.
As hinted at in other sections, there will be a common framework (libraries/packages/classes) for taking care of the tedious bits that every test driver needs to do. Sharing code is essential to easing test driver development as well as reducing their complexity. The framework will contain:
- A generic way of submitting output. This will be a generic interface with multiple implementation, the TESTBOX_REPORTER environment variable will decide which of them to use. The interface will have very specific methods to allow the reporter to do a best possible job in reporting the results to the test manager.
- Helpers for typical tasks, like:
- Copying files.
- Deleting files, directory trees and scratch space.
- Unzipping files.
- Creating ISOs
- And such things.
- Helpers for installing and uninstalling VirtualBox.
- Helpers for defining VMs. (The VBox API where available.)
- Helpers for controlling VMs. (The VBox API where available.)
The VirtualBox bits will be separate from the more generic ones, simply because this is cleaner it will allow us to reuse the system for testing other products.
The framework will be packaged in a zip file other than the test driver so we don't waste time and space downloading the same common code.
The test driver will poll for the file ${TESTBOX_PATH_SCRIPTS}/test-driver-abort and abort all testing when it sees it.
The test driver can be invoked in three modes: execute, help and cleanup. The default is execute mode, the help shows an configuration summary and the cleanup is for cleaning up after a reboot or aborted run. The latter is done by the testbox script on startup and after abort - the driver is expected to clean up by itself after a normal run.
The server side will be implemented using a webserver (apache), a database (postgres) and cgi scripts (Python). In addition a cron job (Python) running once a minute will generate static html for frequently used pages and maybe execute some other tasks for driving the testing forwards. The order queries from the testbox script is the primary driving force in the system. The total makes up the test manager.
The test manager can be split up into three rough parts:
- Configuration (of tests, testgroups and testboxes).
- Execution (of tests, collecting and organizing the output).
- Analysis (of test output, mostly about presentation).
List of requirements:
- Two level testing - L1 quick smoke tests and L2 longer tests performed on builds passing L1. (Klaus (IIRC) meant this could be realized using test dependency.)
- Black listing builds (by revision or similar) known to be bad.
- Distinguish between build types so we can do a portion of the testing with strict builds.
- Easy to re-configure build source for testing different branch or for testing a release candidate. (Directory based is fine.)
- Useful to be able to partition testboxes (run specific builds on some boxes, let an engineer have a few boxes for a while).
- Interaction with ILOM/...: reset systems.
- Be able to suspend testing on selected testboxes when doing maintenance (where automatically resuming testing on reboot is undesired) or similar activity.
- Abort testing on selected testboxes.
- Scheduling of tests requiring more than one testbox.
- Scheduling of tests that cannot be executing concurrently on several machines because of some global resource like an iSCSI target.
- Jump the scheduling queue. Scheduling of specified test the next time a testbox is available (optionally specifying which testbox to schedule it on).
- Configure tests with variable configuration to get better coverage. Two modes:
- TM generates the permutations based on one or more sets of test script arguments.
- Each configuration permutation is specified manually.
- Test specification needs to be flexible (select tests, disable test, test scheduling (run certain tests nightly), ... ).
- Test scheduling by hour+weekday and by priority.
- Test dependencies (test A depends on test B being successful).
- Historize all configuration data, in particular test configs (permutations included) and testboxes.
- Test sets has at a minimum a build reference, a testbox reference and a primary log associated with it.
- Test sets stores further result as a recursive collection of:
- hierarchical subtest name (slash sep)
- test parameters / config
- bool fail/succ
- attributes (typed?)
- test time
- e.g. throughput
- subresults
- log
- screenshots, video,...
- The test sets database structure needs to designed such that data mining can be done in an efficient manner.
- Presentation/analysis: graphs!, categorize bugs, columns reorganizing grouped by test (hierarchical), overviews, result for last day.
Configuration of testboxes doesn't involve much work normally. A testbox is added manually to the test manager by entering the DNS entry and/or IP address (the test manager resolves the missing one when necessary) as well as the system UUID (when obtainable - should be displayed by the testbox script installer). Queries from unregistered testboxes will be declined as a kind of security measure, the incident should be logged in the webserver log if possible. In later dealings with the client the System UUID will be the key identifier. It's permittable for the IP address to change when the testbox isn't online, but not while testing (just imagine live migration tests and network tests). Ideally, the testboxes should not change IP address.
The testbox edit function must allow changing the name and system UUID.
One further idea for the testbox configuration is indicating what they are capable of to filter out tests and test configurations that won't work on that testbox. To examplify this take the ACP2 installation test. If the test manager does not make sure the testbox have VT-x or AMD-v capabilities, the test is surely going to fail. Other testbox capabilities would be total number of CPU cores, memory size, scratch space. These testbox capabilities should be collected automatically on bootup by the testbox script together with OS name, OS version and OS bitness.
A final thought, instead of outright declining all requests from new testboxes, we could record the unregistered testboxes with ip, UUID, name, os info and capabilities but mark them as inactive. The test operator can then activate them on an activation page or edit the testbox or something.
We use the term testcase for a test.
Testcases are organized into groups. A testcase can be member of more than one group. The testcase gets a priority assigned to it in connection with the group membership.
Testgroups are picked up by a testbox partition (aka scheduling group) and a prioirty, scheduling time restriction and dependencies on other test groups are associated with the assignment. A testgroup can be used by several testbox partitions.
(This used to be called 'testsuites' but was renamed to avoid confusion with the VBox Test Suite.)
The initial scheduler will be modelled after what we're doing already on in the tinderbox driven testing. It's best described as a best effort continuous integration scheduler. Meaning, it will always use the latest build suitable for a testcase. It will schedule on a testcase level, using the combined priority of the testcase in the test group and the test group with the testbox partition, trying to spread the test case argument variation out accordingly over the whole scheduilng queue. Which argument variation to start with, is not undefined (random would be best).
Later, we may add other schedulers as needed.
First a general warning:
The guys working on this design are not database experts, web programming experts or similar, rather we are low level guys who's main job is x86 & AMD64 virtualization. So, please don't be too hard on us. :-)
A logical table layout can be found in TestManagerDatabaseMap.png (created by Oracle SQL Data Modeler, stored in TestManagerDatabase.dmd). The physical database layout can be found in TestManagerDatabaseInit.pgsql postgreSQL script. The script is commented.
We need to somehow track configuration changes over time. We also need to be able to query the exact configuration a test set was run with so we can understand and make better use of the results.
There are different techniques for archiving this, one is tuple-versioning ( http://en.wikipedia.org/wiki/Tuple-versioning ), another is log trigger ( http://en.wikipedia.org/wiki/Log_trigger ). We use tuple-versioning in this database, with 'effective' as start date field name and 'expire' as the end (exclusive).
Tuple-versioning has a shortcoming wrt to keys, both primary and foreign. The primary key of a table employing tuple-versioning is really 'id' + 'valid_period', where the latter is expressed using two fields ([effective...expire-1]). Only, how do you tell the database engine that it should not allow overlapping valid_periods? Useful suggestions are welcomed. :-)
Foreign key references to a table using tuple-versioning is running into trouble because of the time axis and that to our knowledge foreign keys must reference exactly one row in the other table. When time is involved what we wish to tell the database is that at any given time, there actually is exactly one row we want to match in the other table, only we've no idea how to express this. So, many foreign keys are not expressed in SQL of this database.
In some cases, we extend the tuple-versioning with a generation ID so that normal foreign key referencing can be used. We only use this for recording (references in testset) and scheduling (schedqueue), as using it more widely would force updates (gen_id changes) to propagate into all related tables.
After receiving a ACK the testbox will ask for work to do, i.e. continue with scenario #2. In the NACK case, it will sleep for 60 seconds and try again.
Actions:
Validate the testbox by looking the UUID up in the TestBoxes table. If not found, NACK the request. SQL:
SELECT idTestBox, sName FROM TestBoxes WHERE uuidSystem = :sUuid AND tsExpire = 'infinity'::timestamp;
Check if any of the information by testbox script has changed. The two sizes are normalized first, memory size rounded to nearest 4 MB and scratch space is rounded down to nearest 64 MB. If anything changed, insert a new row in the testbox table and historize the current one, i.e. set OLD.tsExpire to NEW.tsEffective and get a new value for NEW.idGenTestBox.
ACK the request and pass back the idTestBox.
Actions:
Validate the ID and IP by selecting the currently valid testbox row:
SELECT idGenTestBox, fEnabled, idSchedGroup, enmPendingCmd FROM TestBoxes WHERE id = :id AND uuidSystem = :sUuid AND ip = :ip AND tsExpire = 'infinity'::timestamp;
If NOT found return DEAD to the testbox client (it will go back to sign on mode and retry every 60 seconds or so - see scenario #1).
contrary to the initial plans, we don't need to do anything more for the DEAD status.
Check with TestBoxStatuses (maybe joined with query from 1).
If enmState is 'gang-gathering': Goto scenario #6 on timeout or pending 'abort' or 'reboot' command. Otherwise, tell the testbox to WAIT [done].
If enmState is 'gang-testing': The gang has been gathered and execution has been triggered. Goto 5.
If enmState is not 'idle', change it to 'idle'.
If idTestSet is not NULL, CALL scenario #9 to it up.
If there is a pending abort command, remove it.
If there is a pending command and the old state doesn't indicate that it was being executed, GOTO scenario #3.
however should none be found for some funky reason, returning DEAD will fix the problem (see above)
If the testbox was marked as disabled, respond with an IDLE command to the testbox [done]. (Note! Must do this after TestBoxStatuses maintenance from point 2, or abandoned tests won't be cleaned up after a testbox is disabled.)
Consider testcases in the scheduling queue, pick the first one which the testbox can execute. There is a concurrency issue here, so we put and exclusive lock on the SchedQueues table while considering its content.
The cursor we open looks something like this:
SELECT idItem, idGenTestCaseArgs, idTestSetGangLeader, cMissingGangMembers FROM SchedQueues WHERE idSchedGroup = :idSchedGroup AND ( bmHourlySchedule is NULL OR get_bit(bmHourlySchedule, :iHourOfWeek) = 1 ) --< does this work? ORDER BY ASC idItem;
If there no rows are returned (this can happen because no testgroups are associated with this scheduling group, the scheduling group is disabled, or because the queue is being regenerated), we will tell the testbox to IDLE [done].
- For each returned row we will:
Check testcase/group dependencies.
Select a build (and default testsuite) satisfying the dependencies.
Check the testcase requirements with that build in mind.
If idTestSetGangLeader is NULL, try allocate the necessary resources.
If it didn't check out, fetch the next row and redo from (a).
Tentatively create a new test set row.
- If not gang scheduling:
- Next state: 'testing'
- ElIf we're the last gang participant:
- Set idTestSetGangLeader to NULL.
- Set cMissingGangMembers to 0.
- Next state: 'gang-testing'
- ElIf we're the first gang member:
- Set cMissingGangMembers to TestCaseArgs.cGangMembers - 1.
- Set idTestSetGangLeader to our idTestSet.
- Next state: 'gang-gathering'
- Else:
- Decrement cMissingGangMembers.
- Next state: 'gang-gathering'
- If we're not gang scheduling OR cMissingGangMembers is 0:
Move the scheduler queue entry to the end of the queue.
Update our TestBoxStatuses row with the new state and test set. COMMIT;
EXEC reponse.
The EXEC response for a gang scheduled testcase includes a number of extra arguments so that the script knows the position of the testbox it is running on and of the other members. This means the that the TestSet.iGangMemberNo is passed using --gang-member-no and the IP addresses of the all gang members using --gang-ipv4-<memb-no> <ip>.
WAIT
This is a subfunction of scenario #2 and #5.
As seen in scenario #2, the testbox will send 'abort' commands to /dev/null when it finds one when not executing a test. This includes when it reports that the test has completed (no need to abort a completed test, wasting lot of effort when standing at the finish line).
The other commands, though, are passed back to the testbox. The testbox script will respond with an ACK or NACK as it sees fit. If NACKed, the pending command will be removed (pending_cmd set to none) and that's it. If ACKed, the state of the testbox will change to that appropriate for the command and the pending_cmd set to none. Should the testbox script fail to respond, the command will be repeated the next time it asks for work.
TODO
This is very similar to scenario #2
TODO
This is a subfunction of scenario #2.
When gathering a gang of testboxes for a testcase, we do not want to wait forever and have testboxes doing nothing for hours while waiting for partners. So, the gathering has a reasonable timeout (imagine something like 20-30 mins).
Also, we need some way of dealing with 'abort' and 'reboot' commands being issued while waiting. The easy way out is pretend it's a time out.
When changing the status to 'gang-timeout' we have to be careful. First of all, we need to exclusively lock the SchedQueues and TestBoxStatuses (in that order) and re-query our status. If it changed redo the checks in scenario #2 point 2.
If we still want to timeout/abort, change the state from 'gang-gathering' to 'gang-gathering-timedout' on all the gang members that has gathered so far. Then reset the scheduling queue record and move it to the end of the queue.
When acting on 'gang-timeout' the TM will fail the testset in a manner similar to scenario #9. No need to repeat that.
When a testbox completes a gang scheduled test, we will have to serialize resource cleanup (both globally and on testboxes) as they stop. More details can be found in the documentation of 'gang-cleanup'.
So, the transition from 'gang-testing' is always to 'gang-cleanup'. When we can safely leave 'gang-cleanup' is decided by the query:
SELECT COUNT(*) FROM TestBoxStatuses, TestSets WHERE TestSets.idTestSetGangLeader = :idTestSetGangLeader AND TestSets.idTestBox = TestBoxStatuses.idTestBox AND TestBoxStatuses.enmState = 'gang-running'::TestBoxState_T;
As long as there are testboxes still running, we stay in the 'gang-cleanup' state. Once there are none, we continue closing the testset and such.
TODO
This is a subfunction of scenario #1 and #2. The actions taken are the same in both situations. The precondition for taking this path is that the row in the testboxstatus table is referring to a testset (i.e. testset_id is not NULL).
Actions:
The UI needs to be able to clean up the remains of a testbox which for some reason is out of action. Normal cleaning up of abandoned testcases requires that the testbox signs on or asks for work, but if the testbox is dead or in some way indisposed, it won't be doing any of that. So, the testbox sheriff needs to have a way of cleaning up after it.
It's basically a manual scenario #9 but with some safe guards, like checking that the box hasn't been active for the last 1-2 mins (max idle/wait time * 2).
One of the testbox sheriff's tasks is to try figure out the reason why something failed. The test manager will provide facilities for doing so from very early in it's implementation.
We need to work out some useful status reports for the early implementation. Later there will be more advanced analysis tools, where for instance we can create graphs from selected test result values or test execution times.
This has changed for various reasons. The current plan is to implement the infrastructure (TM & testbox script) first and do a small deployment with the 2-5 test drivers in the Testsuite as basis. Once the bugs are worked out, we will convert the rest of the tests and start adding new ones.
We just need to finally get this done, no point in doing it piecemeal by now!
The implementation of the test manager and adjusting/completing of the testbox script and the test drivers are tasks which can be done by more than one person. Splitting up the TM implementation into smaller tasks should allow parallel development of different tasks and get us working code sooner.
The goal is to getting the fundamental testmanager engine implemented, debugged and working. With the exception of testboxes, the configuration will be done via SQL inserts.
Tasks in somewhat prioritized order:
- Kick off test manager. It will live in testmanager/. Salvage as much as possible from att/testserv. Create basic source and file layout.
- Adjust the testbox script, part one. There currently is a testbox script in att/testbox, this shall be moved up into testboxscript/. The script needs to be adjusted according to the specification layed down earlier in this document. Installers or installation scripts for all relevant host OSes are required. Left for part two is result reporting beyond the primary log. This task must be 100% feature complete, on all host OSes, there is no room for FIXME, XXX or @todo here.
- Implement the schedule queue generator.
- Implement the testbox dispatcher in TM. Support all the testbox script responses implemented above, including upgrading the testbox script.
- Implement simple testbox management page.
- Implement some basic activity and result reports so that we can see what's going on.
- Create a testmanager / testbox test setup. This lives in selftest/.
- Set up something that runs, no fiddly bits. Debug till it works.
- Create a setup that tests testgroup dependencies, i.e. real tests depending on smoke tests.
- Create a setup that exercises testcase dependency.
- Create a setup that exercises global resource allocation.
- Create a setup that exercises gang scheduling.
- Check that all features work.
The goal is getting to VBox testing.
Tasks in somewhat prioritized order:
- Implement full result reporting in the testbox script and testbox driver. A testbox script specific reporter needs to be implemented for the testdriver framework. The testbox script needs to forward the results to the test manager, or alternatively the testdriver report can talk directly to the TM.
- Implement the test manager side of the test result reporting.
- Extend the selftest with some setup that report all kinds of test results.
- Implement script/whatever feeding builds to the test manager from the tinderboxes.
- The toplevel test driver is a VBox thing that must be derived from the base TestDriver class or maybe the VBox one. It should move from toptestdriver to testdriver and be renamed to vboxtltd or smth.
- Create a vbox testdriver that boots the t-xppro VM once and that's it.
- Create a selftest setup which tests booting t-xppro taking builds from the tinderbox.
The goal for this milestone is configuration and converting current testcases, the result will be the a minimal test deployment (4-5 new testboxes).
Tasks in somewhat prioritized order:
- Implement testcase configuration.
- Implement testgroup configuration.
- Implement build source configuration.
- Implement scheduling group configuration.
- Implement global resource configuration.
- Re-visit the testbox configuration.
- Black listing of builds.
- Implement simple failure analysis and reporting.
- Implement the initial smoke tests modelled on the current smoke tests.
- Implement installation tests for Windows guests.
- Implement installation tests for Linux guests.
- Implement installation tests for Solaris guest.
- Implement installation tests for OS/2 guest.
- Set up a small test deployment.
After milestone #3 has been reached and issues found by the other team members have been addressed, we will probably go for full deployment.
Beyond this point we will need to improve reporting and analysis. There may be configuration aspects needing reporting as well.
Once deployed, a golden rule will be that all new features shall have test coverage. Preferably, implemented by someone else and prior to the feature implementation.
[1] | no such footnote |
Status: | $Id: AutomaticTestingRevamp.html 96564 2022-09-01 09:06:13Z vboxsync $ |
---|---|
Copyright: | Copyright (C) 2010-2020 Oracle Corporation. |