Test

static/images/Archimede_bain.jpg

Greek philosopher Archimedes_ in his bath - 16th Century carving.

Archimedes' insight into water displacement led to the solution of a problem posed by Hiero of Syracuse, on how to assess the purity of an irregular golden votive crown; he had given his goldsmith the pure gold to be used, and correctly suspected he had been cheated, by the goldsmith removing gold and adding the same weight of silver.
static/images/first_actual_case_of_bug_being_found.jpg

The First "Computer Bug" Moth found trapped between points at Relay # 70, Panel F, of the Mark II Aiken Relay Calculator while it was being tested at Harvard University, 9 September 1947. The operators affixed the moth to the computer log, with the entry: "First actual case of bug being found". (The term "debugging" already existed; thus, finding an actual bug was an amusing occurrence.)

A test (= trial = experiment) is a procedure to determine if a proposition is true_. (See decision problem.)

Contents

1   Functions

static/images/boehm_curve.png

Boehm's Curve, illustrating that bugs become more expensive to fix the later they are discovered.

Further:

Anything that tells you about a mistake earlier, not only makes things more reliable because you find the bugs, but the time you don't spend hunting bugs is time you can spend doing something else.

`James Gosling`_

Testing is a form of dynamic analysis.

People uses tests for many purposes:

Writing tests has several side benefits, which may be just as important. Writing tests:

Tests generate feedback which inform developers about qualities of the system, such as how easily we can cut a version and deploy, how well the design works, and how good the code is.

1.1   Error localization

As test get closer to what they test, it easier to determine what a test failure means.

Often it takes considerable work to pinpoint the source of a test failure.

2   Behavior

2.1   White-box testing

White-box testing is a method of testing that tests internal structures or workings of an application, as opposed to its functionality.

This seems the least useful at these tests are the most likely to be affected by change

3   Substance

3.1   Diagnostics

static/images/tdd_cycle_improving_diagnostics.png

When tests fail, they give diagnostics. These diagnostics should be helpful. If they are not, the test code should be rewritten until the error messages guide us to the problem with the code. [2]

The easiest way to improve diagnostics is to keep each test small and focused. [2]

3.2   System-under-test

The system-under-test is the thing that is actually being test.

3.3   Test requirement

A test requirement is a specific element of a software artifact that a test case must satisfy or cover.

Test requirements cannot be satisfied are called infeasible and this is sometimes the case. For instance, dead code is infeasible to test. The detect of infeasible test requirements is formally undecideable.

Test requirements usually come in sets.

4   Properties

4.2   Economy

4.2.1   Execution time

Large tests tend to take longer to execute.

Tests should run fast. Tests that take too long to run end up not being run.

In practice, this means tests should avoid communicating to databases, across networks, or with files.

4.3   Clarity

Testers should write tests so that their intention is clear to others.

4.4   Feedback

When writing unit and integration tests, we stay alert for areas of the code that are difficult to test. When we find a feature that's difficult to test, we don't just ask ourselves how to test it, but also why it's difficult to test. When code is difficult to test, the most likely cause is that the design needs improvement.

4.6   Validity

Validity is the extent to which a test answers the questions it is intended to answer.

Large tests, like integration tests, have low validity for many question, because they do not help much to diagnose issues. On the other hand, unit tests, which have high isolation and few variables, give us useful information when trying to answer questions. In science, experiments are extremely isolated; scientists carefully ensure the control group matches the experimental group in every detail except the experimental one. Highly isolated tests also are more resistant to breaking when the system changes.

4.6.1   Content validity

In psychometrics, content validity (also known as logical validity) refers to the extent to which a measure represents all facets of a given social construct.

4.6.2   Criterion validity

In psychometrics, criterion validity is a measure of how well one variable or set of variables predicts an outcome based on information from other variables, and will be achieved if a set of measures from a personality test relate to a behavioral criterion on which psychologists agree.[

4.6.3   Construct validity

Construct validity is the degree to which a test measures what it claims, or purports, to be measuring (contrast with face validity).

4.6.3.1   Convergent validity

Convergent validity refers to the degree to which two measures of constructs that theoretically are related, are in fact related.

4.6.3.2   Discriminant validity

Discriminant validity refers to the degree to which two measures of constructs that are theoretically unrelated are, in fact, unrelated

Take, for example, a construct of general happiness. If a measure of general happiness had convergent validity then constructs similar to happiness (satisfaction, contentment, cheerfulness, etc.) should relate closely to the measure of general happiness. If this measure has discriminate validity than constructs that are not supposed to be related to general happiness (sadness, depression, despair, etc.) should not relate to the measure of general happiness.

4.7   Internal Validity

Internal validity is the extent to which casual inferences of a study are warranted.

Such warrant is constituted by the extent to which a study minimizes systematic error (or 'bias').

Inferences are said to possess internal validity if a causal relation between two variables is properly demonstrated.

Indicators of high internal validity include:

  • Respondent validation ("member checks")

Indicators of low internal validity include:

  • Confounding variables
  • Selection bias
  • History
  • Maturation
  • Repeated testing
  • Instrument change
  • Regression toward the mean
  • Mortality
  • Selection-maturation interaction
  • Diffusion
  • Compensatory rivalry/resentful demoralization
  • Experimenter bias

4.8   External Validity (Generalizability)

External validity is the extent to which generalizability of the results of a study is warranted.

There are two kind of generalizability:

  • Across situations: The extent to which we can generalize from the situation constructed by an experimenter to real-life situations.
  • Across people: The extent to which we can generalize from the people who participated in the experiment to people in general.

Indicators of high external validity include:

  • Mundane realism: The extent to which an experiment is similar to a real-life situation. Field experiments possess high mundane realism.
  • Psychological realism: The extent to which the psychological processes triggered in an experiment are similar to psychological processes that occur in real-life situations. Self-reported behavioral predications possess low mundane realism.
  • Random population: Subjects in experiments are randomly chosen.
  • Replication: conducting the study over again, generally with different subject populations or in different settings.
  • Meta-analysis: The extent to which the results of similar but unique studies corroborate.

Indicators of low external validity include:

  • Aptitude-treatment interaction:
  • Situation: That an experiment is conducted in a specific environment limits generalizability.
  • Pre-test effects:
  • Post-test effects
  • Reactivity: Difference in subject behavior may be a result of the subject being aware of the observer, the subject's expectation of the treatment, or novelty, rather than because of treatment.
  • Rosenthal effects: Interviewer expectations for the candidate risk self-fulfilling prophecies.

5   Concepts

5.1   Seam

A seam is a place in a program_ where you can alter behavior without editing in that place.

Every seam has an enabling point, a place where you can make the decision to use one behavior or another.

The types of seams available to us vary among programming languages.

5.1.1   Preprocessing seams

Only relevant for language with preprocessor, e.g C.

5.1.3   Object seam

An object seam is a when we refactor code to exist in a private method, then subclass that object to change behavior.

Object seams rely on dynamic dispatch; method calls in an object-oriented program do not define which method will actually be executed.

For example in the following code, a reference to cell.update() is ambiguous:

However, method calls are not always seams. Consider the following:

Since the choice of update depends on the class of the cell and the class of the cell is decided when the object is created, we cannot change this code without modifying it. We could refactor to make cell.update a seam:

5.1.4   Coverage criterion

A coverage criterion is a rule or collection of rules that impose test requirements on a test set. Coverage criterion describe the test requirements in a complete and unambiguous manner.

Thus we turn to coverage criteria, which decide which test inputs to use. Coverage criteria are defined in terms of test requirements.

We measure test sets against a criterion in terms of coverage. Given a set of test requirements TR for a coverage criterion C, a test set T satisfies C if and only if for every test requirement tr in TR, at least one test in T exists such that t satisfies tr. The coverage level is the ratio of the number of test requirements satisfied by the test set to the total number of test requirements.

5.1.5   Coverage

Only trivial programs can be tested exhaustively.

Units of software often can take an infinite number of inputs. Therefore, exhaustive testing is not possible.

Complete coverage does not mean there are no errors. For one, executing bad code does not necessarily lead to an error. But more likely, cover does not actually check the execution path-- cover checks physical lines rather than logical lines. Therefore a line like x = 1 if y else 2 will be "covered" even if y was always the same value.

One assumption Nagappan examined was the relationship between code coverage and software quality. Code coverage measures how comprehensively a piece of code has been tested; if a program contains 100 lines of code and the quality-assurance process tests 95 lines, the effective code coverage is 95 percent. The ESM group measures quality in terms of post-release fixes, because those bugs are what reach the customer and are expensive to fix. The logical assumption would be that more code coverage results in higher-quality code. But what Nagappan and his colleagues saw was that, contrary to what is taught in academia, higher code coverage was not the best measure of post-release failures in the field. [5]

Code coverage is not indicative of usage. If 99 percent of the code has been tested, but the 1 percent that did not get tested is what customers use the most, then there is a clear mismatch between usage and testing. There is also the issue of code complexity; the more complex the code, the harder it is to test. So it becomes a matter of leveraging complexity versus testing. After looking closely at code coverage, Nagappan now can state that, given constraints, it is more beneficial to achieve higher code coverage of more complex code than to test less complex code at an equivalent level. [5]

6   Test-Driven Development Cycle

static/images/tdd_cycle_start.png

Source: Freeman Pryce 2010. Move to section on TDD cycle

TDD teams produced code that was 60 to 90 percent better in terms of defect density than non-TDD teams. They also discovered that TDD teams took longer to complete their projects—15 to 35 percent longer. [5]

7   Classification

static/images/software_testing_v_model.png

The "V model" of software development activities and testing levels. [4]

Testers derive information for each kind of test from the associated development activity. [4] Standard advice is to the tests concurrently with each development activity, even thought the software will not be in an executable form until the implementation phase since the process of articulating tests can identify defects in design decisions. [4]

Although most of the literature emphasizes these levels in terms of when they are applied, a more important distinction is on the types of faults that we are looking for. The faults are based on the software artifact that we are testing, and the software artifact that we derive the tests from. For example, unit and module tests are derived to test units and modules, and we usually try to find faults that can be found when executing the units and modules individually. [4]

One of the best examples of the differences between unit testing and system testing can be illustrated in the context of the infamous Pentium bug. In 1994, Intel introduced its Pentium microprocessor, and a few months later, Thomas Nicely, a mathematician at Lynchburg College in Virginia, found that the chip gave incorrect answers to certain floating-point division calculations. The chip was slightly inaccurate for a few pairs of numbers; Intel claimed (probably correctly) that only one in nine billion division operations would exhibit reduced precision. The fault was the omission of five entries in a table of 1,066 values (part of the chip’s circuitry) used by a division algorithm. The five entries should have contained the constant +2, but the entries were not initialized and contained zero instead. The MIT mathematician Edelman claimed that “the bug in the Pentium was an easy mistake to make, and a difficult one to catch,” an analysis that misses one of the essential points. This was a very difficult mistake to find during system testing, and indeed, Intel claimed to have run millions of tests using this table. But the table entries were left empty because a loop termination condition was incorrect; that is, the loop stopped storing numbers before it was finished. This turns out to be a very simple fault to find during unit testing; indeed analysis showed that almost any unit level coverage criterion would have found this multimillion dollar mistake. [4]

Some faults can only be found at the system level. One dra- matic example was the launch failure of the first Ariane 5 rocket, which exploded 37 seconds after liftoff on June 4, 1996. The low-level cause was an unhandled floating-point conversion exception in an internal guidance system function. It turned out that the guidance system could never encounter the unhandled exception when used on the Ariane 4 rocket. In other words, the guidance system function is correct for Ariane 4. The developers of the Ariane 5 quite reasonably wanted to reuse the successful inertial guidance system from the Ariane 4, but no one reana- lyzed the software in light of the substantially different flight trajectory of Ariane 5. Furthermore, the system tests that would have found the problem were technically difficult to execute, and so were not performed. The result was spectacular – and expensive! [4]

Another public failure was the Mars lander of September 1999, which crashed due to a misunderstanding in the units of measure used by two modules created by separate software groups. One module computed thruster data in English units and forwarded the data to a module that expected data in metric units. This is a very typical integration fault (but in this case enormously expensive, both in terms of money and prestige). [4]

One final note is that object-oriented (OO) software changes the testing levels. OO software blurs the distinction between units and modules, so the OO software testing literature has developed a slight variation of these levels. Intramethod testing is when tests are constructed for individual methods. Intermethod testing is when pairs of methods within the same class are tested in concert. Intraclass testing is when tests are constructed for a single entire class, usually as sequences of calls to methods within the class. Finally, interclass testing is when more than one class is tested at the same time. The first three are variations of unit and module testing, whereas interclass testing is a type of integration testing. [4]

7.1   Acceptance test

An acceptance test is a test that determines whether the completed software meets a need captured during requirements analysis; whether the software does what the users want. [4] Acceptance testing must involve users or other individuals who have strong domain knowledge. [4]

Acceptance tests should be written using only terminology from the application's domain, not from the underlying technologies. [2] This helps us understand what the system should do, without tying us to any of our initial assumptions about the implementation or complicating the test with technological details. This also shields our acceptance test suite from changes to the system's technical infrastructure.

An acceptance test (end-to-end test) is a test that exercises the system end-to-end. [2] Acceptance tests interact with the system only from the outside: through its user interface, by sending message as if from third-party systems. [2]

For example, developers writing acceptance tests for interfaces to web applications typically use Selenium to simulate how the user interacts with a web browser.

Ideally, this will also test the build system.

Running acceptance tests tells us about the external quality of our system. [2]

Writing them tells us something about how well the team understands the domain; they help us clarify what we want to achieve. They do not tell us anything about how well we've written the code. [2]

7.2   System test

static/images/eiffel_tower_dimensions.JPG

A system test is a test that determines if an assembled system meets the technical specification produced during architecture (which is intended to meet the requirements). [4] It assumes components work individually and checks that they work as a whole. This level of testing usually looks for design and specification problems. [4]

7.3   Integration test

An integration test is a test that verifies components integrate together -- whether interfaces between modules have consistent assumptions and communicate correctly -- typically with external services that cannot be changed. [2]

This is important when working with libraries because you may be reading documentation that is newer than the version actually on your system. If you only write unit tests with mocks, then you will discover an issue until it hits production.

  • Subsystem design specifies structure and behavior of subsystems, each of which is intended to satisfy some function in the overall architecture [4]

Integration testing is designed to determine whether the interfaces between modules in a given subsystem have consistent assumptions and communicate correctly. [4] Integration testing must assume that modules work correctly. [4]

7.4   Module test

Detailed design determines the structure and behavior of individual modules. [4]

A program unit, or procedure, is one or more continuous program statements with a name that other parts of the software use to call it. [4] Units are called function in C and C++, procedures or function in Ada, methods in Java, and subroutines in Fortran. [4] A module is a collection of related units that are assemebled in a file, package, or class. [4] This corresponds to a file in C, a package in Ada, and a class in C++ and Java. [4] Module testing is designed to assess individual modules in isolation, including how the component units interact with each other and their associated data structures. [4] Most software development organizations make module testing the responsibility of the programmer. [4]

7.5   Functional test

Functional tests ensure that the higher-level code you have written functions in the way that users of your application would expect. For example, a functional test might ensure that the correct form was displayed when a user visited a particular URL or that when a user clicked a link, a particular entry was added to the database. [9]

7.6   Unit test

A unit test tests an isolated piece of software; typically an atomic behavioral unit of a system. In procedural code, the units are often functions, and in OO code, the units are classes.

Unit testing can be difficult since top-level functions call other functions, which call others function, etc.

Use test doubles.

Writing unit tests tells us about the internal quality of our system. Running them tells us we haven't broken anything, but not much confidence the whole system works. [2]

Thorough unit testing helps us improve the internal quality because, to be tested, a unit has to be structured to run outside the system in a test fixture. A unit test for an object needs to create the object, provide its dependencies, interact with it, and check that it behaved as expected. So, for a class to be easy to unit- test, the class must have explicit dependencies that can easily be substituted and clear responsibilities that can easily be invoked and verified. In software engineer- ing terms, that means that the code must be loosely coupled and highly cohesive—in other words, well-designed. [2]

Unit tests helps us design classes and give us confidence that they work.

Implementation is the phase of software development that actually produces code. Unit testing is designed to assess the units produced by the implementation phase and is the “lowest” level of testing. In some cases, such as when building general-purpose library modules, unit testing is done without knowledge of the encapsulating software application. As with module testing, most software development organizations make unit testing the responsibility of the programmer. It is straightforward to package unit tests together with the corresponding code through the use of tools such as JUnit for Java classes. [4]

7.7   Regression test

A software regression is a software bug which makes a feature stop functioning as intended after a certain event (for example, a system upgrade).

The intent of regression testing is to ensure that a change has not introduced new faults.


Tangential is regression testing, which is testing done after changes are made to software, its purpose to help ensure that updated software still possesses the functionality it had before the updates. [4]

8   Running tests

Tests should be first divided into unit and integration tests, and then by component, rather than vice-versa. This is because integration tests often have very different needs than unit tests. [7]

Unit tests and integration should run sequentially, not in parallel, since they may interact in surprising ways. Therefore, if you are using tox, you should generally organize it like this [7]:

[tox]
envlist = unit, integration

[testenv]
deps =
    -r{toxinidir}/requirements.txt
    pytest

[testenv:unit]
commands = py.test -m "not integration" {posargs:tests/}

[testenv:integration]
commands = py.test -m integration {posargs:tests/}

9   Techniques

9.1   Assertions

There is systematic use of assertions in some Microsoft components, as well as synchronization between the bug-tracking system and source-code versions; this made it relatively easy to link faults against lines of code and source-code files. The research team managed to find assertion-dense code in which assertions had been used in a uniform manner; they collected the assertion data and correlated assertion density to code defects. [5] The team observed a definite negative correlation: more assertions and code verifications means fewer bugs. Looking behind the straight statistical evidence, they also found a contextual variable: experience. Software engineers who were able to make productive use of assertions in their code base tended to be well-trained and experienced, a factor that contributed to the end results. [5]

9.1.1   Compound assertions

Use compound assertions to avoid duplications. For example, consider the following:

assertEqual('Joe', person.getFirstName())
assertEqual('Bloggs', person.getLastName())
assertEqual(23, person.getAge())

If these assertions are frequently used, then it may make sense to create a custom assertPersonEqual assertion.

Rails ships with a number of custom assertions. [3]

9.2   Custom matchers

A matcher is an object that reports whether a given object matches some criteria. For example:

assertThat(s, not(containsString("bananas")))

They are used by Hamcrest in Java. In Python, we could use predicate functions.

They also report error messages.

9.3   Decorators

You can test decorators by refactoring the code into another non-decorator function and then making the decorator only a convenient wrapper.

For example:

# before

@app.route("/")
def hello():
    return "Hello World!"

# after

def _hello():
    return "Hello World!"
hello = app.route("/")(_hello)

9.4   Testing Abstract Classes

The right way to test abstract classes is mostly just to get rid of them.

  • If the subclasses rely on shared functionality, then make the abstract class concrete and compose the subclasses with their former parent.
  • If the subclasses do not rely on shared functionality, then they are probably their to implement abstract methods, in which case, what we really want are things that implement an interface. So create an interface, make the subclasses implement it, make the abstract class concrete, and then compose it with the implementers of the interface.

Exactly under what circumstances should an abstract class exist? It seems narrow.

9.5   Differences across Operating Systems

Systems which call out to utilities provided by the operating system may not be portable across different machines. For example, a system that invokes the fuser utility to check if any computer processes are using a file will experience different results on Darwin and Linux. For instance, if no processes are using a file, fuser exists with 0 on Darwin (which uses BSD fuser) but 1 on Linux (which uses GNU). Another example might be compiling C programs with gcc.

These differences may be discovered before pushing to production by QA on staging, or running tests on continuous integration servers that use the same machines as production. Another solution is to use production settings (usually GNU) everywhere using Vagrant_ locally (the issue with Vagrant is that there's no Amazon Linux vagrant image and never will be so it still doesn't match production). Another solution, when possible is to write or find a library which replicates the behavior of the utility in a portable way (such as Python's os library).

9.6   Tips for CI servers

  • When configuring continuous integration servers with Python, make sure to remove .pyc files, otherwise it can cause subtle bugs if the workspace is reused (e.g. if one PR turns a module into a package, and another doesn't the second one will fail because the package/__init__.pyc will persist and be read first). A simple way do this is via find . -name "*.pyc" -delete.

10   Grading

The level of testing should match the cost of failure.

A good study ranks highly on each of the following criteria:

  1. Morality: the degree to which a study respects ethical standards
  2. Economy: the cost of a study
  3. Validity: the degree to which a study answers the questions it intends to answer (for example a educational test)
  4. Reliability: the degree to which a study gives consistent results
  5. Generalizability: the degree to which a study can generalize its findings to larger populations

In practice, researchers routinely make tradeoffs, for instance, compromising rigor to reduce costs, or have enough control over the environment to ensure there are no confounding variables and ensuring results can be generalized to everyday life.

11   Layout

Rails lays out its test like this:

tests/
    controllers/
    functional/
    helpers/
    integration/
    mailers/
    models/
    unit/

12   Production

Test can be derived from requirements and specifications, design artifacts, or source code.

13   Tools

14   See also

15   Further reading

16   References

[1]`Myers 2005`_
[2](1, 2, 3, 4, 5, 6, 7, 8, 9, 10) Freeman Pryce 2010
[3]http://api.rubyonrails.org/classes/ActionDispatch/Assertions.html
[4](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24) Paul Ammann & Jeff Offutt. 2008. Introduction to Software Testing. Ammann Offutt 2008
[5](1, 2, 3, 4, 5) Janie Chang. October 7, 2009. Microsfot Research. Exploding Software-Engineering Myths.
[6](1, 2, 3) James A. Whittaker; Jason Arbon; Jeff Carollo March 23, 2012. How Google Tests Software.
[7](1, 2) Eric Allen. Conversation.
[8]

James Whittaker. Aug 27, 2012. EuroSTAR Software Testing Video: Ten Minute Test Plan with James Whittaker. https://www.youtube.com/watch?v=QEu3wmgTLqo

Watch from 5:00 onward.

[9]James Gardner. http://pylonsbook.com/en/1.0/testing.html

The legacy code dilemma: when we change code we should have tests in place. To put test in place, we often have to change code.

When you break dependencies in legacy code, you often have to do so conservatively and in ugly ways. Not all dependencies break cleanly. If later you can cover the code you can repair it later.


Roughly, making a change in a legacy code base consists of five steps:

  1. Identify change points
  2. Find test points
  3. Break dependencies
  4. Write tests
  5. Make changes and refactor

Sensing and separation

Generally when we want to get tests in place, the are two reasons to break dependencies:

  1. Sensing: Break dependencies to sense when we can't access values our code computes
  2. Separation: Break dependencies to separate when we can't get a piece of code into a test harness to run

We can't sense the effect of our calls to methods on this class, and we can't run it separately from the rest of the application.


Having a monolithic code with coincidently cohesive code base makes running tests slower because you have to run all the tests.


You don't take a test to determine your level. Rather, you take a test designed for a particular level. You either pass or fail.

I wonder if some standardized tests can be, in principle, constructed to measure programming abilities.

The tests must be crafted such that, even if all the questions are known to the public, then studying for the test is the same thing as studying to build up your skill level.


(from myself)

Test are used instead of proving correctness ("formal verification") because it's often impossible or impractical to prove a non-trivial program is correct. The next best option is to show that a program is probably not incorrect. Analysis, then, is all about reducing the probability that bugs exist in a program.

There are two types of analysis: static and dynamic. Static analysis tools include type checking, contracts, syntax checkers, and style checkers. Dynamic analysis tools include unit tests, integration tests, regression tests, and assertions.

Fuzz testing involves calling a function with random inputs and checking if the program crashes or if any assertions fail.

Boundary testings involves designing test cases around the extremes of the input domain, e.g. maximum, minimum, and just inside/outside boundaries.

Field testing involves releasing code and relying on users to report bug and then writing a test case to ensure it doesn't happen again.

Equivalence partitioning maximizes code coverage while minimizing the number of test cases written. It's often impossible or impractical to test every possible input (e.g. consider a function which accepts any integer). Writing tons of tests to test similar inputs is a waste of time. Figure out which inputs will cause your code to run differently and write a test for each of them.


Instead of distinguishing between code, integration, and system testing, Google uses the language of "small", "medium", and "large" tests, emphasizing scope over form. [6] The original purpose of using small, medium, and large was to standardize terms that so many testers brought in from other employers where smoke tests, BVTs, integrations tests, and so on had multiple and conflicting meanings. Those terms had so much baggage that it was felt that new ones were needed. [6]

Small tests cover a single unit of code in a completely faked environment. Medium tests cover multiple and interacting units of code in a faked or real environment. Large tests cover any number of units of code in the actual production environment with real and not faked resources. [6]


Unit tests should aim for high coverage. According to Eric 60% is good, 80% is a solid target.

Integration tests should disregard code coverage. Instead they cover the major behaviors you expect from the system. What lines they cover is besides the point.


Too many assertions, and especially the wrong kinds of assertions create a maintenance hazard in your test suite. Let's illustrate the point with an example, a function which parses a log line from a web server in a hypothetical pipe-separated format, returning a dictionary:

def test_it_parses_log_lines(self):

line = "2015-03-11T20:09:25|GET /foo?bar=baz|..."

parsed_dict = parse_line(line)

It is tempting to use the familiar assertEqual with a dictionary whose contents are identical to the parsed result we expect:

def test_it_parses_log_lines(self):

line = "2015-03-11T20:09:25|GET /foo?bar=baz|..."

parsed_dict = parse_line(line)

self.assertEqual({
"date": datetime(2015, 3, 11, 20, 9, 25), "method": "GET", "path": "/foo", "query": "bar=baz",

}, parsed_dict)

This will certainly work, assuming the parse function is implemented correctly. But what kind of error messages will we get from this test? Take a look:

AssertionError: {'date': datetime.datetime(2015, 3, 11, 20, 9, 25), 'path': '/foo', 'method': 'G [truncated]... != {'date': datetime.datetime(2015, 3, 11, 20, 9, 25), 'path': '/foo?', 'method': ' [truncated]... Can you see what the error is? No? Neither can I. Fortunately, we can fix this fairly easily:

def test_it_parses_log_lines(self):

line = "2015-03-11T20:09:25|GET /foo?bar=baz|..."

parsed_dict = parse_line(line)

self.assertEqual(
datetime(2015, 3, 11, 20, 9, 25), parsed_dict["date"],

) self.assertEqual("GET", parsed_dict["method"]) self.assertEqual("/foo", parsed_dict["path"]) self.assertEqual("bar=baz", parsed_dict["query"])

Now when our test fails, we can see exactly what's going on:

AssertionError: '/foo' != '/foo?' We fix the bug, and move on with our lives. Eventually, we'll expand the functionality of the parser, and add more test cases. Inevitably, these assertions will be copied and pasted, with some modification:

def test_it_parses_get_request_log_lines(self):
# ... self.assertEqual("GET", parsed["method"]) self.assertEqual("/foo", parsed["path"]) self.assertEqual("bar=baz", parsed["query"])
def test_it_parses_post_request_log_lines(self):
# ... self.assertEqual("POST", parsed["method"]) self.assertEqual("/foo", parsed["path"]) self.assertEqual("bar=baz", parsed["query"])
def test_it_parses_log_lines_with_oof_requests(self):
# ... self.assertEqual("GET", parsed["method"]) self.assertEqual("/oof", parsed["path"]) self.assertEqual("bar=baz", parsed["query"])

Notice that only one assertion is actually changing in each test, but we're repeating most of the others. This means that if we break one feature -- say request path parsing -- we'll get failures from most of our test cases. In order to figure out what's actually broken, we'll have to dig through each failure and think hard about each test case to figure out what to fix.

You should ask yourself, for each test case, "What would I do when this test case fails?" We can minimize the effort required to transate a test failure into a fix by following one simple rule: make only one assertion per test case. Yup, only one.

Let's rewrite this test suite using the new rule:

def test_it_parses_request_method(self):
# ... self.assertEqual("GET", parsed["method"])
def test_it_parses_request_path(self):
# ... self.assertEqual("/foo", parsed["path"])
def test_it_parses_query_string(self):
# ... self.assertEqual("bar=baz", parsed["query"])
def test_it_gives_none_when_no_query_string(self):
# ... self.assertIsNone(parsed["query"])

Now, when we break request path parsing, we'll only get one test case failure, and the test case name and failure traceback will point us directly at what we need to fix: the request path parsing portion of our code.

When you adopt this style of test design, you'll also find that the way you write your test cases changes. Before, we had test cases representing some special cases we expect in the real world: an "oof" request, a POST request, etc.

Now, our test cases reflect the orthogonal behaviors of the code under test: it parses the request method, path, query string, etc. Not only are the test failures more instructive, but our test cases now serve as better documentation: scanning the names tells us what the parse_line function actually does (and, by extension, what it doesn't do)


It seems like a very reasonable idea to me that when mocking out external services, you always a create a mixin class for it. For example, if you are writing a test which communicates with a Facebook SDK, you should write class, MockFacebookSDK which can be mixed into TestCases to mock Facebook. This way you have a reusable component. The alternative is that each test takes care of its mocking, which is very redundant.


"Code that is hard to test in isolation is poorly designed" is a non sequitur. Tests may induce design damage; change to your code that either facilitate either test-first, speedy test, or unit tests by harming the clarity of code. [1]_

[1]http://david.heinemeierhansson.com/2014/test-induced-design-damage.html

Mutation testing (or mutation analysis or program mutation) is used to design new software tests and evaluate the quality of existing software tests. Mutation testing involves modifying a program in small ways.[1] Each mutated version is called a mutant and tests detect and reject mutants by causing the behavior of the original version to differ from the mutant. This is called killing the mutant.

New tests can be designed to kill additional mutants. Mutants are based on well-defined mutation operators that either mimic typical programming errors (such as using the wrong operator or variable name) or force the creation of valuable tests (such as dividing each expression by zero).