Testing Methods

The importance of testing has long been recognized. In 1879, N.W. Ayer & Sons used ad testing to land a major account. By 1920, ad testing was one of the standard services offered by advertising agencies and researchers.

In times of great competition for consumers, marketers rely more heavily on ad testing. During the Great Depression, for example, U.S. and Canadian manufacturers, ad agencies and advertising research firms developed new testing methods and expanded existing techniques.

The earliest testing methods were behavioral measures. The ad agency Lord & Thomas began an ongoing testing operation in the U.S. in 1900. It requested that its clients provide mail-order and sales-fluctuation records for all advertised products. By 1906 its Record of Results Department was analyzing data from more than 600 clients.

In 1933, A.C. Nielsen Co. (later ACNielsen Corp.) began to audit product sales in food and drugstores, providing the first widely available measure of share of sales. Package-goods marketers subscribing to the service could analyze sales response vs. geographical coverage of the media vehicles in which their ads appeared. Later store audits used scanner data to record sales.

Panel data combine TV viewing and purchase behavior. For example, ScanAmerica, a joint venture of Control Data's Arbitron Ratings Co. and Selling Areas-Marketing Inc. (SAMI), began collecting panel data in Denver in 1985. Participants record purchases at home by passing a penlike wand over bar codes. The information is transmitted daily through Arbitron's people meter device attached to participants' TV sets; telephone lines carry people meter data to Arbitron's electronic data centers for compilation and analysis.

With split viewing, households in a community are divided into equivalent demographic groups, their purchase behavior is monitored and test spots are inserted into the TV or cable signal of a subset of households. The method requires at least six months and is very costly ($200,000 to $300,000 per test). Scanner and panel data are less expensive because subscribers share the costs.

Information Resources Inc. and ACNielsen have services that combine split viewing and scanner data. IRI monitors TV viewing in more than 3,000 households and can test commercials over cable systems. Before household supermarket purchases, an identification card is swiped and all sales data are sent to IRI. ACNielsen's system has the advantage of testing on-air commercials; thus, its sample is not skewed toward higher-income cable subscribers.

In the 1920s, the primary ad testing method was ad inquiry testing, using coupons incorporated into ads. The coupons could be cut out and returned for product samples, information booklets or special premiums. Coupons were keyed by the post-office box or room number to which they were addressed. To let the advertiser know which magazines or newspapers produced which responses, researchers tabulated the returned coupons to evaluate the effectiveness of the ads. This allowed comparisons of identical ads in different magazines or newspapers and comparisons of ads with different copy or art work in the same vehicle.

Recognition and recall

Recognition tests show an ad to people, then ask if they remember having seen it before. Recognition measures memory traces left under typically low-involvement processing. Recognition tests produce results similar to other measures of advertising effectiveness (e.g., ad inquiries). Problems include the expense of individual interviews and the tendency for people to report seeing ads they did not see.

A well-known print-ad recognition test was developed by pioneering advertising researcher Daniel Starch in 1922 and it has been used by his organization, Starch Continuing Readership Research Program, since February 1932 in the U.S. and since 1949 in Canada.

In a subject's home, the interviewer goes through the issue page by page, asking the subject about each ad being studied. To determine whether the subject noted the ad while reading, the interviewer asks, "Did you see any part of this advertisement?"

If the answer is yes, the subject is asked to indicate which parts of the ad were processed. For each ad, three scores are calculated: (1) noted (the percentage of readers who recognize the advertisement as one they previously saw in that magazine issue), which measures an ad's attention-getting ability; (2) associated (the percentage of readers who saw or read any part of the advertisement that clearly indicated the brand advertised), which indicates the level of brand processing; and (3) read most (the percentage of readers who read half or more of the ad's written material), which indicates reader involvement.

Despite the expense of personal interviews, the cost of Starch's recognition test is kept down by syndicating results to many corporate and ad agency subscribers.

Unlike Roper Starch, which focuses on the extent to which an ad is noticed, the other most commonly used ad post-testing service, Gallup & Robinson, measures recall. Recall requires mental reproduction of the ad, while recognition is awareness of having previously seen it. Recall measures ad message penetration and the correctness of the impressions communicated. However, studies have repeatedly failed to link recall with measures of persuasion or sales.

Day-after-recall (DAR) is a measure of the percentage of the people who recall something specific about an ad (e.g., sales message or a visual) the day following exposure. DAR was developed by George Gallup in the early 1940s.

In a TV spot DAR test, the Burke Day-After-Recall Test, the evening after a commercial appears on a prime-time network program, interviewers make thousands of random phone calls until they have contacted about 200 people who were watching the program when the commercial appeared. Interviewers ask the subjects if they remember any commercials for the product category in question (i.e., unaided recall). If they remember the category but do not identify the brand in question, the interviewers ask if they remember seeing a commercial for that brand (i.e., aided recall). They are then asked what the commercial said about the brand, what it showed, what it looked like and what the main ideas were.

Physiological measures

Physiological pretest measures record consumers' physical reactions to ad messages. Since the reactions are involuntary, they are unlikely to be biased by subjects attempting to behave or answer in socially acceptable ways.

One disadvantage is that physiological measures can neither determine if a response was positive or negative nor if the consumer learned any brand information. Another disadvantage is the unnatural environment of ad exposure; many tests involve the attachment of instruments to subjects in a laboratory.

Eye-tracking systems can be used to monitor eye movements across spots or print ads. This is done either with a beam of infrared light that reflects off the subject's eye or with goggles connected to a computer that records the wearer's eye movements, pupil dilation and the amount of time spent viewing different parts of an ad. The resulting data indicate whether a subject is processing the elements of the ad in the order the advertiser intended. The data can be misleading, however, as a subject's eye may linger because of comprehension difficulty on the one hand or rapt attention on the other.

Galvanic skin response is a tool that was in vogue during the 1940s and '50s. GSR measures minute changes in perspiration or electrical resistance of the skin, which indicate arousal when viewing advertisements.

Pupil dilation response, a method popular in the 1960s, tracks changes in pupil size, an indication of the amount of information processed while viewing an ad or commercial. More recent research has discounted the notion that PDR measures emotional response (i.e., the greater the dilation, the more positive the response).

Voice response analysis measures vocal inflections when discussing an ad. Subjects are asked to respond to a set of ads. Responses are recorded and computer analyzed. Deviations from a flat response indicate arousal or excitement.

With conjugately programmed analysis of advertising, subjects operate a foot or hand device controlling audio and video TV signal intensity. Subjects must exert effort to sustain the signals, which decay in a preprogrammed pattern. Exertion indicates attention and interest.

Electroencephalographic data can be collected by placing electrodes on the front, back, left and right of a subject's scalp. During ad exposure, EEG data from each location are recorded. Analysis of the frequency and amplitude of the recorded impulses is used to determine the ability of the whole ad and its components to attract attention.

The tachistoscope, basically a slide projector with controlled presentation time and illumination, assesses an advertisement's communication speed. Faster ad recognition is correlated with higher readership.

Binocular rivalry tests competing advertising stimuli (e.g., adjacent billboards, packages or ads) presented simultaneously, one to each eye. Illumination and presentation time can be controlled. When two stimuli are given an equal chance to dominate awareness, the one with more impact should predominate.

Persuasion tests

Persuasion tests typically use before/after designs. People from the target market are recruited and their pre-exposure brand attitudes are measured. They are then exposed to the test ad. Following exposure, their attitudes are measured again to gauge the effect of the ad on brand attitudes.

Video Storyboard Tests evaluate rough or finished print ads in a mock magazine, Looking at Us. Subjects are asked to preview a pilot issue of this new magazine. Individual interviews are conducted in shopping malls. Subjects rate ads on persuasion, product uniqueness, believability, competitive strength and likability.

Persuasion tests are fairly expensive, averaging $11,000 to $15,000. Hard-to-find samples and more realistic exposure increase costs. To control costs, subjects are usually asked to evaluate four to six ads in different product categories during the same test.

