This is the abstract of a book that will be accorded multiple book review in Behavioral and Brain Sciences (Copyright 1997: Cambridge University Press) and is shortly to be circulated for Multiple Peer Review. This preprint of the Precis is for inspection only, to help prospective reviewers decide whether or not they wish to review the book. Please do not prepare a review unless you have received the invitation, instructions and deadline information. It would be helpful if you let us know whether you already have the book or would require a copy.
For information on becoming a reviewer or commentator on this or other BBS target articles, write to: bbs@soton.ac.uk
For information about subscribing or purchasing offprints of the published version, with commentaries and author's response, write to: journals_subscriptions@cup.org (North America) or journals_marketing@cup.cam.ac.uk (All other countries).


Précis of "STATISTICAL SIGNIFICANCE: RATIONALE, VALIDITY AND UTILITY" London: Sage 1996

Siu L. Chow

Department of Psychology
University of Regina
Regina
Saskatchewan
CANADA S4S 0A2
chowsl@leroy.cc.uregina.ca

Keywords

Bayes' rule; conditional probability; confidence interval; deduction; effect size; experimental design; hypothesis testing; induction; likelihood ratio; power analysis; statistical inference

Abstract

The null-hypothesis significance-test procedure (NHSTP) is defended in the context of the theory-corroboration experiment, as well as the following contrasts: (a) substantive hypotheses versus statistical hypotheses, (b) theory corroboration versus statistical hypothesis testing, (c) theoretical inference versus statistical decision, (d) experiments versus nonexperimental studies, and (e) theory corroboration versus treatment assessment. The null hypothesis can be true because it is the hypothesis that errors are randomly distributed in data. Moreover, the null hypothesis is never used as a categorical proposition. Statistical significance means only that chance influences can be excluded as an explanation of data; it does not identify the nonchance factor responsible. The experimental conclusion is drawn with the inductive principle underlying the experimental design. A chain of deductive arguments gives rise to the theoretical conclusion via the experimental conclusion. The anomalous relationship between statistical significance and the effect size often used to criticize NHSTP is more apparent than real. The absolute size of the effect is not an index of evidential support for the substantive hypothesis. Nor is the effect size, by itself, informative as to the practical importance of the research result. Being a conditional probability, statistical power cannot be the a priori probability of statistical significance. The validity of statistical power is debatable because statistical significance is determined with a single sampling distribution of the test statistic based on H0, whereas it takes two distributions to represent statistical power or effect size. Sample size should not be determined in the mechanical manner envisaged in power analysis. It is inappropriate to criticize NHSTP for nonstatistical reasons. At the same time, neither effect size nor confidence interval estimate nor posterior probability can be used to exclude chance as an explanation of data. Nor can any of them fulfill the nonstatistical functions expected of them by critics.


PrÈcis of 'Statistical Significance: Rationale, Validity and Utility'

This précis of Statistical Significance: Rationale, Validity and Utility (Chow, 1996) begins with a description of the main themes of its eight chapters. As criticisms of the null-hypothesis significance-test procedure (NHSTP) are answered in the context of the theory-corroboration experiment, the rationale of theory corroboration and the logical foundation of experimentation are described after a description of NHSTP itself. It is argued that NHSTP can (and should) be defended when some conceptual or metatheoretical distinctions are made. 'Theory' and 'hypothesis' will be used interchangeably in subsequent discussion even though the former has a more grandiose connotation.

To begin with, as the statistical hypothesis is not the substantive hypothesis (Meehl, 1978), to corroborate a substantive hypothesis is more than testing a statistical hypothesis. Similarly, drawing a theoretical conclusion is more than deciding whether or not the result is statistically significant (Tukey, 1960). It further follows that research data and conclusions are not (and should not be) accepted or rejected on the mere basis of statistical significance. Some criticisms of NHSTP seem persuasive when these distinctions are not made. Other criticisms of NHSTP are based on criteria imported from domains outside statistics. A case will be made that the dissatisfaction with NHSTP stems from attempts to use it to fulfill functions that belong to the theory-corroboration or treatment-assessment process. The alternative numerical indices (viz., effect size, confidence interval estimate, and statistical power) proposed by critics of NHSTP (henceforth referred to as critics) cannot fulfill these nonstatistical functions.

1. An Overview of 'Statistical Significance'

'Statistical Significance' begins by recounting the commonly known criticisms of NHSTP in Chapter 1. Also described is the methodological paradox that psychologists may inadvertently find support for weaker theories when they improve their research methods (Meehl, 1967). The basic structure and rationale of NHSTP is illustrated with a completely randomized 1-factor, 2-level quasi-experiment in Chapter 2. It is shown that the null hypothesis can be true, particularly in experimental studies with manipulated variables. Also defended is the hybrid nature of NHSTP.

To distinguish between a substantive and a statistical hypothesis, the quartet of hypotheses associated with the to-be-studied phenomenon in the theory-corroboration experiment is introduced in Chapter 3. It is shown that the null hypothesis appears twice in NHSTP, once as the consequent and once as the antecedent of two conditional propositions. That statistical hypothesis testing is not theory corroboration is seen from the role statistical significance plays in the chain of deductive reasoning discussed in Chapter 4. The outcome of NHSTP is to supply the minor premise for the innermost of the series of three embedding conditional syllogisms.

Two meanings of 'effect' are identified in Chapter 5. The anomalous relationship between statistical significance and effect size is more apparent than real because, in terms of the technical meaning of 'effect,' the effect size is not indicative of the amount of evidential support for the substantive hypothesis offered by data. Nor is the effect size, by itself, informative about the practical importance of the research result. Some conceptual difficulties with power analysis are identified in Chapter 6. Being a conditional probability, statistical power cannot be the a priori probability of obtaining statistical significance. Some of the issues raised by power analysts are concerns about the stability of the data. It is argued that the stability issue is neither a numerical nor a mechanical one.

The methodological assumptions underlying Bayesian statistics are considered in Chapter 7. The applicability of the Bayesian approach is questioned because the prototype of empirical research congenial to the Bayesian argument is not typical of psychological research, particularly the theory-corroboration kind. Experimental data can be defended in a relativistic milieu. The main arguments in defense of NHSTP are summarized in Chapter 8 with reference to a set of questions suggested by criticisms of NHSTP.

2. Criticisms of NHSTP

NHSTP has been criticized since the 60's (Morrison & Henkel, 1970). The same litany of criticisms of NHSTP is repeated periodically by various critics, as is noted recently by Thompson (1996). Some of the commonly known difficulties of relying on NHSTP are that (a) statistical significance may be due to the fortuitous choice of the sample size or the a level, (b) the null hypothesis is never true, (c) nothing can be learned from statistical significance about the inverse probability of the hypothesis (i.e., the probability that the hypothesis is true, given the data), (d) the binary nature of NHSTP is antithetical to the fact that knowledge advances in an incremental manner, (e) statistical significance is not informative about the values of parameters, (f) the Type II error is unjustifiably neglected, and (g) nothing about the practical impact of the research result can be learned from its statistical significance.

Critics find it puzzling that psychologists persist in using NHSTP. This state of affairs indicates that NHSTP users suffer from distorted statistical intuitions and conceptual confusion (Gigerenzer, 1993). However, the resiliency of NHSTP is warranted. It can be shown that the criticisms of NHSTP are debatable. The frame of reference used in the present defence of NHSTP is suggested by Meehl (1990) and Cohen (1994), but they restrict their criticisms of NSHTP to non-experimental studies. Meehl (1967) adds that his criticisms are more applicable to experiments using subject variables (e.g., sex, race, educational level, etc.) than to those using manipulated variables (e.g., stimulus duration, method of training, etc.). These caveats raise two interesting questions:

Why should NHSTP be more problematic in the case of subject-variable experiments than manipulated-variable experiments? [Q1]

What renders NHSTP more satisfactory in an experiment than a non-experiment? [Q2]

Questions [Q1] and [Q2] suggest that many of the criticisms of NHSTP are not statistical in nature. The real issue is whether or not the research result is brought about by procedural artifacts or confounding variables. That is, criticisms of NHSTP are actually concerns about inductive conclusion validity (see Campbell & Stanley, 1963; Chow, 1992; Cook & Campbell, 1979).

3. The Quartet of Hypotheses Underlying the Theory-corroboration Experiment

In view of Questions [Q1] and [Q2], it may be instructive to reconsider the criticisms of NHSTP in the context of the theory-corroboration experiment. Moreover, some hitherto neglected distinctions may be seen more readily when such a frame of reference is adopted. For such an end, consider first the quartet of hypotheses implicated in the theory-corroboration experiment with reference to Table 1. (Ignore the entries in italics for the moment, i.e., Propositions [P1.1'] through [P1.5'].)

Table 1. The logical relations among the to-be-explained phenomenon, theory, research hypothesis, experimental hypothesis and statistical hypotheses (alternative and null) in a theory-corroboration experiment

Level of Discourse

What Is Said At The Level Concerned
To-be-explained phenomenon
The linguistic competence of native speakers of English
Substantive Hypothesis
The linguistic competence of native speakers of English is an analog of the transformational grammar. [P1.1]
Complement of Theory
The linguistic competence of a native speaker of English is not an analog of the transformational grammar. [P1.1']
Research Hypothesis
If [P1.1], then it is more difficult to process negative sentences than kernel sentences. [P1.2]
Complement of Research Hypothesis
If -[P1.1], then there is no difference in difficulty processing negative and kernel sentences.[P1.2']
Experimental Hypothesis
If the consequent of [P1.2], then it is more difficult to remember extra words after a negative sentence than a kernel sentence. [P1.3]
Complement of Experimental Hypothesis
If not the consequent of [P1.2], then it is equally difficult to remember extra words after a negative and a kernel sentence. [P1.3']
Statistical Alternative Hypothesis
If the consequent of [P1.3], then H1.* [P1.4]
Statistical Null Hypothesis
If not the consequent of [P1.3], then Ho. [P1.4']
Sampling Distribution of H1
If H1, then the probability associated with a difference between kernel and negative sentences as extreme as 1.729 standard error (tdf=19) units from an unknown mean difference is not known. [P1.5]
Sampling between Distribution
If H0, then the probability associated with a difference of H0 kernel and negative sentences as extreme as 1.729 standard error (tdf=19) units a mean difference of zero is 0.05 in the long run. [P1.5']

*H1 = mean of extra-sentence words recalled after negative sentences < mean of extra-sentence words recalled after kernel sentences.

H0 = mean of extra-sentence words recalled after negative sentences * mean of extra-sentence words recalled after kernel sentences.

Consider the phenomenon of linguistic competence that native speakers of English can understand and generate an infinite number of grammatical utterances. A hypothesis that has been used to explain this phenomenon is Miller's (1962) rendition of Chomsky's (1957) transformational grammar (see [P1.1] in Table 1). This psychological analog of the transformational grammar is a substantive hypothesis, and it is an explanatory theory.

Many theoretical implications follow from the hypothesis that transformational grammar is psychologically real. One such implication is that non-kernel sentences (e.g., negative sentences) are more difficult to process than kernel sentences. Specifically, while the kernel sentence is generated with the phrase-structure rules, a negative sentence requires the additional step of applying a negative transformation to the kernel sentence. The relationship between the substantive hypothesis and the implication in question is represented by [P1.2] in Table 1. The consequent of the conditional proposition, [P1.2], is the research hypothesis. However, in such a form, the research hypothesis is not well-defined enough for experimentation. For example, it is necessary to specify the nature of the processing involved.

The problem of vagueness with [P1.2] is resolved by stipulating (a) a well-defined experimental task in a specific setting, and (b) a dependent variable whose identity is independent of the substantive hypothesis. A simplified version of Savin and Perchonock's (1965) task may be used to illustrate the solution. Subjects are presented with 8 words after being shown either a kernel or a negative sentence on any trial. Suppose further that the repeated-measures design is used. That is, the same subjects receive both types of sentences in the course of the experiment.

The subjects are first to recall the sentence verbatim and then to recall as many of the 8 extra words as possible. In the context of this experimental situation and of the auxiliary assumption that the short-term store has a limited capacity (Miller, 1956), an implication of the consequent of [P1.2] is that it is more difficult to remember extra words after a negative sentence than a kernel sentence. This implication of the research hypothesis is the experimental hypothesis, which appears as the consequent of [P1.3] in Table 1.

As the experimental hypothesis is not amenable to statistical analysis in its present form, it is necessary to derive its implication at the statistical level. Specifically, the implication is that the mean of extra-sentence words recalled after negative sentences is smaller than that after kernel sentences. This implication is more commonly known as the statistical alternative hypothesis (H1), and it is the consequent of [P1.4].

Consider the logical complement of H1, in Table 1. It is stated that the mean of extra-sentence word recalled after negative sentences is equal to or larger than that after kernel sentences (see the consequent of [P1.4'] in Table 1). This logical complement of H1 is the statistical null hypothesis (H0). Given that whatever is true under the 'larger than' component of H0 is subsumed under the 'equal to' component, the 'larger than' component serves no further purpose in the present discussion.

That this appeal to H0 is neither contrived nor arbitrary may be seen from the entries in italics in Table 1. The steps of derivation of [P1.3'] from [P1.1'] are the same as those implicated in deriving [P1.3] from [P1.1]. Hence, [P1.3'] is not contrived if [P1.1'] is not an arbitrary assertion. Being the logical complement of [P1.1], [P1.1'] is not a whimsical statement. In other words, H0 is not as arbitrary as it has been characterized (see, e.g., Fisher, 1959; Rozeboom, 1960; Thompson, 1996).

The null hypothesis has two utilities. First, it is used to specify the sampling distribution of differences required for the test of significance (see [P1.5']). Second, a decision about H1 may be made through making a decision about H0 because these two statistical hypotheses are mutually exclusive and exhaustive (see the 'H0, Data and Chance Influences' discussion in Section 12 for an explication).

In sum, underlying the theory-corroboration experiment is a quartet of hypotheses, namely, the substantive, research, experimental, and statistical alternative hypotheses. It can be seen that neither H0 nor H1 is the substantive, research or experimental hypothesis. Hence, it becomes necessary to distinguish between testing a substantive hypothesis at the conceptual level with empirical data (i.e., theory corroboration) and testing a statistical hypothesis (viz., statistical hypothesis testing). At the same time, it is noted in [P1.5] in Table 1 that H1 cannot be used to specify the to-be-used sampling distribution of differences that underlies the t test because the magnitude of the difference between the means of the kernel and negative sentences is not specified in H1. The complement of H1 (i.e., H0) is used instead (hence, [P1.5'] in Table 1). This invites a closer examination of NHSTP, particularly in view of the generally accepted verdict that H0 is never true.

4. The Null-hypothesis Significance-test Procedure (NHSTP)

A consideration of how theory corroboration differs from statistical hypothesis testing may begin with a brief recounting of the rationale and procedure of NHSTP. Suppose that Savin and Perchonock's (1965) task is used, and the statistical alternative hypothesis is that fewer words are recalled after recalling negative sentences than kernel sentences. H1 and H0 are commonly (but misleadingly) written as follows under such circumstances.:

(a) H1: uunegative < ukernel

(b) H0: uunegative >= ukernel

Suppose further that the repeated-measures design is used, and there are 20 subjects. This experiment will be referred to as the 'kernel-negative experiment' in subsequent discussion. The usual a level is set at 0.05. Strictly speaking, the test is whether or not the associated probability, p, of the calculated t is smaller than 0.05. By 'associated probability' is meant 'the probability of [the calculated t] plus the probabilities of all more extreme possible values' under H0 (Siegel, 1965, p. 11). In actual practice, the t (dependent sample in this example) is calculated, and compared to the critical value of t (i.e., -1.728, df = 19, a = .05) for this particular one-tailed test.

This critical value of -1.729 is given by the appropriate t distribution, which is the standardization of the sampling distribution of differences (Siegel, 1956). The binary decision is to choose between 'calculated t -1.729' and 'calculated t > -1.729.' The outcome of this binary decision determines the choice between the two modus ponens arguments depicted in the two top panels in Table 2. If the calculated t is -1.729 or smaller, the decision is that the result is significant (i.e., the 'not H0' conclusion in the top left panel of Table 2). If the calculated t is larger than the critical value, it is decided that the result is not significant (i.e., the 'H0' conclusion in the top right panel of Table 2).

Table 2: Two conditional syllogisms (upper panel) and the disjunctive syllogism (lower panel) implicated in the null-hypothesis significance testing procedure (NHSTP)

Upper Panel

Criterion Exceeded
Criterion Not Exceeded
Major Premise
If Calculated t * (criterion = -1.729), then not H0
If Calculated t > (criterion = -1.729), then H0
Minor Premise
t * (criterion = -1.729) [e.g., Calculated t = -2.05]
t > (criterion = -1.729) [e.g., calculated t= -1.56]
Conclusion
Not H0
H0

Lower Panel:

Statistical Significance Obtained
Major Premise:
H1 or H0
Minor Premise:
Not H0
Conclusion:
Therefore, H1.

It is assumed that H1and H0 are mutually exclusive and exhaustive (see the 'H0, Data and Chance Influences' discussion in Section 12). Hence, denying H0 leads to accepting H1 by virtue of the disjunctive syllogism depicted in the lower panel of Table 2. The experimental conclusion drawn from a statistically significant result is that fewer words are recalled after recalling negative sentences than kernel sentences.

Of interest is the fact that the experimental conclusion is about the relationship between two variables (viz., sentence type and number of extra words recalled). However, theoretical conclusions go beyond a mere functional relationship between the independent and dependent variables. The theoretical interest is what the nature of the linguistic competence is. This more sophisticated meaning of research data at the theoretical level is not informed by the NHSTP exercise depicted in Table 2. This consideration has not featured in the debate about the validity or utility of NHSTP because discussants have in mind a different type of experiment (a point to be discussed in Section 21, the 'Differences Between the Utilitarian and Theory-corroboration Experiments'). To see how the theoretical meaning is extracted from experimental data, it is necessary to consider what constitutes the theory-corroboration process.

5. The Rationale of the Theory-corroboration Experiment

To corroborate the substantive hypothesis experimentally is to show that the experimental data are consistent with the tenability of the substantive hypothesis. That is, there is 'warranted assertibility' (Manicas & Secord, 1983). This idea suggests that a crucial consideration in theory corroboration is the logical relationship between the substantive hypothesis and the evidential data. Such a consideration requires more than a statistical decision. Also implicated is the judicious application of deductive and inductive logic in different stages of the exercise.

6. The Role of Deductive Logic in the Theory-corroboration Experiment

Table 1 shows that H1 is three implicative steps from the substantive hypothesis. At the same time, there is a chain of deductive reasoning leading from experimental data to the substantive hypothesis via H1, the experimental hypothesis and the research hypothesis. This series of deductive reasoning may be seen more readily if the logical relations among the quartet of hypotheses shown in Table 1 are expressed in the form of a series of three embedding conditional syllogisms, as in Table 3.

Table 3. The series of three embedding syllogisms (in normal font, italics, and boldface, respectively) underlying the theory-corroboration procedure when the null hypothesis is rejected

Major Premise 3

If [P1.1]1 in Table 1, then [P3.1].2
[MAJ-3.3]7
Major Premise 2
If [P3.1]2, then [P3.2].3
[MAJ-3.2]6
Major Premise 1
If [P3.2], then H14
[MAJ-3.1]5
Minor Premise 1
H1 is true.
[MIN-3.1]
Conclusion 1
Therefore, [P3.2] is true in the interim by virtue of experimental controls)
[CON-3.1]
Minor Premise 2
[P3.2] is true in the interim.
[MIN-3.2]
Conclusion 2
Therefore, [P3.1] is true in the interim (by virtue of experimental controls).
[CON-3.2]
Minor Premise 3
[P3.1] is true in the interim.
[MIN-3.3]
Conclusion 3
Therefore, [P1.1] in Table 1 is true in the interim (by virtue of experimental controls).
[CON-3.3]

1 [P1.1] in Table 1 = The linguistic competence of a native speaker of English is an analog of the transformational grammar.

2 [P3.1] = It is more difficult to process negative sentences than kernel sentences (i.e., the consequent of [P1.2] in Table 1).

3 [P3.2] = It is more difficult to remember extra words after a negative sentence than a kernel sentence (i.e., the consequent of [P1.3] in Table 1).

4 H1 = mean of extra-sentence words recalled after negative sentences < mean of extra-sentence words recalled after kernel sentences.

5 [MAJ-3.1] is [P1.4] in Table 1.

6 [MAJ-3.2] is [P1.3] in Table 1.

7 [MAJ-3.3] is [P1.2] in Table 1.

The syllogisms in Table 3 are called 'conditional syllogisms' because their major premises are conditional propositions (viz., [MAJ-3.1], [MAJ-3.2] and [MAJ-3.3]). The first (or the innermost) syllogism is made up of [MAJ-3.1], [MIN-3.1] and [CON-3.1]. The second syllogism consists of [MAJ-3.2], [MIN-3.2], and [CON-3.2]. [MAJ-3.3], [MIN-3.3] and [CON-3.3] collectively make up the last syllogism.

The minor premise of the first syllogism (i.e., [MIN-3.1]) is the outcome of NHSTP. The example depicted is one in which the data permit the rejection of H0. To have established statistical significance is to accept that H1 is true. To assert that H1 is true in the first syllogism is to affirm the consequent of the conditional proposition, [MAJ-3.1]. The tentative conclusion is drawn that the antecedent of [MAJ-3.1] is true. This conclusion is used as the minor premise of the second syllogism to affirm the consequent of [MAJ-3.2]. This leads to the tentative conclusion that the antecedent of [MAJ-3.2] is true. Lastly, the conclusion of the second syllogism serves as the minor premise of the third syllogism. The antecedent of [MAJ-3.3] is concluded true tentatively when its consequent is affirmed by the antecedent of [MAJ-3.2].

7. The Modus Tollens and Affirming the Consequent Asymmetry

Note that all three conclusions in Table 3 (i.e., [CON-3.1], [CON-3.2] and [CON-3.3]) are qualified with the caveat, 'in the interim (by virtue of experimental controls).' The 'in the interim' qualification is necessary because there are alternative substantive hypotheses at the conceptual (see Section 32, the 'Alternative Substantive Hypothesis versus Statistical Alternative Hypothesis,' for an elaboration). The 'by virtue of experimental controls' qualification is necessary because deductive logic does not permit accepting the antecedent of a conditional proposition when its consequent is affirmed (Copi, 1982). Hence, the propriety of accepting the antecedents of [MAJ-3.1], [MAJ-3.2] and [MAJ-3.3] in Table 3 has to be warranted by experimental controls, as discussed in Section 8, 'Induction, Experimental Design and Controls.'

Suppose that the outcome of NHSTP does not permit rejecting H0. The chain of reasoning is shown in Table 4, in which the propositions in Table 3 are given a different set of numbers for identification purposes. For example, [MAJ-3.1] in Table 3 becomes [MAJ-4.1] in Table 4.

Table 4. The series of 3 embedding syllogisms (in roman font, italics, and boldface, respectively) underlying the theory-corroboration procedure when the null hypothesis is not rejected

Major Premise 3

If [P1.1]1 in Table 1, then [P4.1].2
[MAJ-4.3]7

Major Premise 2

If [P4.1], then [P4.2].3
[MAJ-4.2]6

Major Premise 1

If [P4.2], then H1.4
[MAJ-4.1]5

Minor Premise 1

H1 is not true.
[MIN-4.1]

Conclusion 1

Therefore, [P4.2] is not true.

[CON-4.1]

Minor Premise 2
[P4.2] is not true.
[MIN-4.2]

Conclusion 2

Therefore, [P4.1] is not true.

[CON-4.2]

Minor Premise 3
[P4.1] is not true.
[MIN-4.3]

Conclusion 3

Therefore, [P1.1] in Table 1 is not true.

[CON-4.3]

1[P1.1] in Table 1

The linguistic competence of a native speaker of English is an analog of the transformational grammar.

2[P4.1]

It is more difficult to process negative sentences than kernel sentences (i.e., the consequent of [P4.2] in Table 1).
3[P4.2]
It is more difficult to remember extra words after a negative sentence than a kernel sentence (i.e., the consequent of [P4.3] in Table 1).

4H1

Mean of extra-sentence words recalled after negative sentences < mean of extra-sentence words recalled after kernel sentences.

5[MAJ-4.1]

is [P1.4] in Table 1.
6[MAJ-4.2]
is [P1.3] in Table 1.
7[MAJ-4.3]
is [P1.2] in Table 1.

The minor premise of the first conditional syllogism in Table 4, [MIN-4.1], is 'Not-H1.' Hence, the antecedent of [MAJ-4.1] is rejected by modus tollens. The minor premise of the second syllogism, [MIN-4.2] is, in such an event, the denial of the consequent of [MAJ-4.2]. The modus tollens rule leads to the rejection of the antecedent of [MAJ-4.2]. Hence, [MIN-4.3] is the negation of the antecedent of [MAJ-4.2]. Consequently, [MIN-4.3] is the denial of the antecedent of [MAJ-4.3]. The third application of the modus tollens rule leads to the rejection of the antecedent of [MAJ-4.3], namely, [P1.1].

Unlike the case of affirming the consequent, modus tollens (i.e., denying the consequent of a conditional proposition) permits the unambiguous rejection of the antecedent of the conditional proposition. The difference between the arguments in Tables 3 and 4 is the asymmetry between modus tollens refutation and affirming the consequent confirmation of theories identified by Meehl (1967, 1978). It is noted here that the asymmetry is not brought about by using NHSTP. Instead, it is the consequence of the deductive reasoning implicated in corroborating theories. Hence, it is necessary to consider why affirming the consequent of [MAJ-3.1] (i.e., rejecting H0) does not guarantee the truth of its antecedent.

8. Induction, Experimental Design and Controls

Boring (1954, 1969) and Campbell (1969; Campbell & Stanley, 1963) pointed out that to consider experimental controls was to consider Mill's (1973) methods of scientific inquiry (with the exception of his method of agreement; see Cohen & Nagel, 1934). That is to say, underlying a valid experimental design is one of Mill's (1973) inductive methods (viz., method of difference, joint method of agreement and difference, method of residue, and method of concomitant variations). This may be illustrated with Table 5, in which is depicted the repeated-measures 1-factor, 2-level design used in the kernel-negative experiment described earlier.

Table 5. The inductive basis of the repeated-measures 1-factor, 2-level design (Method of Difference)

Condit-ion

Independent Variable

Control Variables






Extraneous Variables






(Sentence-type)

C1

C2

C3

C4

C5

C6

E1

E2

...

En


Dependent Variable

Control

Kernel Sentence
NI
T
I
R
S
C
ER
IT
...
M

Number of extra words recalled

Experimental

Negative Sentence
NI
T
I
R
S
C
ER
IT
...
M

Number of extra words recalled

C1 =

Normal intonation (NI)

C2 =
Task presentation via recorded tape (T)

C3 =
Interval between end of sentence and beginning of words (I)

C4 =
Rate of word presentation; 3/4 second per word (R)

C5 =
Structure of sentence; 'Animal' subject, present perfect transitive verb (S)

C6 =
Fixed categories of words used in 'extra' words (C)

E1 =
Extra-curricular reading (ER)

E2 =
Individual interests (IT)

En =
Kernel and negative sentences randomly mixed (M)

The design of the kernel-negative experiment is described in Table 5 in a way that reflects the inductive principle of Mill's (1973) method of difference (see Chow, 1992). Suppose that fewer words are recalled after negative sentences than after kernel sentences, and that the difference is statistically significant. The control variables (C1, C2, C3, C4, C5 and C6) can be excluded as explanations of the significant difference because each of them (e.g., C1) is represented by the same value (viz., NI) at both levels of the independent variable. This is one of the 'constancy of condition' meanings of the term 'control' (Boring, 1954, 1969).

The extraneous variables (E1, E2, ... En) may also be excluded because each of them (e.g., E1) is assumed to be represented at the same level (viz., ER). This assumption is justified by the fact that the same subject is tested in both the experimental and control conditions. Consequently, the difference between the 'Kernel' and 'Negative' conditions is rendered unambiguous by the fact that the experimental and control conditions are identical in all aspects but one. The only difference is brought about by the difference between the two levels of the independent variable).

9. Conflating NHSTP with Theory Corroboration

NHSTP is misunderstood because no distinction is made between the substantive and statistical hypotheses. Specifically, Meehl (1967) notes that there is a tendency to conflate the substantive hypothesis with the statistical hypothesis. This practice seems to be condoned when it is said, "the critical distinction between a statistical hypothesis and a substantive theory often breaks down. To perform a significance test a substantive theory is not needed at all" (Oakes, 1986, p. 42, emphasis in italics added).

What is said in the italicized sentence is true, but not because the distinction between the substantive and statistical hypotheses is unimportant or not real. It is true simply because testing a hypothesis at the statistical level (see Table 2) and corroboration a substantive hypothesis with empirical data at the conceptual level (viz., Table 3) are radically different exercises. This issue will be dealt with further in the `Differences Between the Utilitarian and Theory-corroboration Experiments' discussion in Section 21.

10. Answers to Questions [Q1] and [Q2]

It may be concluded from the foregoing argument that, to the extent that all recognized control variables and procedures are included in the experiment, the statistically significant result may be attributed to the independent variable (Campbell, 1969). The experiment is said to have inductive conclusion validity under such circumstances (Chow, 1987a, 1992). For this reason, the propriety of accepting the antecedent of a conditional proposition by affirming its consequent in Table 3 is justified with the 'in the interim' proviso.

The answer to Question [Q1] may be seen readily from Table 6. Suppose that the kernel-negative experiment is conducted to assess the differential linguistic competence of science students from two disciplines. Neither the repeated-measures nor the completely randomized design can be used. Hence, different selected groups of subjects have to be assigned to the two levels of the independent variable, Faculty of Study. While it is possible to maintain the constancy of condition in the case of some control variables, such is not the case with the extraneous variables. An extraneous variable (e.g., E1) may be represented at different levels in the experimental and control conditions (viz., ER and ER', respectively) as a result of some fundamental differences between students of the two disciplines.

Table 6. Violation of the formal requirement of Method of Difference when a subject variable is used


Subject Variable

Control Variables






Extraneous Variables






(Faculty of Study)

C1

C2

C3

C4

C5

C6

E1

E2

...

En


Dependent Variable

C

Biological Sciences
NI
T
I
R
S
C
ER
IT'
...
M''

Number of extra words recalled
E
Physical Sciences
NI
T
I
R
S
C
ER'
IT
...
M'

Number of extra words recalled

C1 =

Normal intonation (NI)

C2 =
Task presentation via recorded tape (T)

C3 =
Interval between end of sentence and beginning of words (I)

C4 =
Rate of word presentation; 3/4 second per word (R)

C5 =
Structure of sentence; 'Animal' subject, present perfect transitive verb (S)

C6 =
Fixed categories of words used in 'extra' words (C)

E1 =
Extra-curricular reading (ER or ER')

E2 =
Individual interests (IT or IT')

En =
Kernel and negative sentences randomly mixed (M' or M'')

In short, the design of an empirical research is a description of how the data collection conditions are arranged. The empirical study is an experiment if the arrangement of its data collection conditions satisfies the formal requirement of one of Mill's (1973) inductive principles. The formal requirement makes it possible to exclude as explanations those factors that have been incorporated in the design as control variables or procedures. Various aspects of the formal requirement give rise to the three technical meanings of 'control.' They are (a) a valid comparison baseline, (b) constancy of conditions, and (c) provisions for excluding procedural artifacts (Boring, 1954, 1969; Chow, 1987a, 1992). Data interpretation becomes unambiguous to the extent that all recognized alternative interpretations are excluded by the judicious application of experimental controls (Campbell, 1969)..

An empirical study is a quasi-experiment when its design satisfies only some parts of the formal requirement. A non-experimental study (e.g., the correlational study) is one which there is no formal provision for satisfying the formal requirement. Hence, there is no provision for excluding alternative interpretation of the result in non-experimental studies. Given the fact that experimental controls serve to exclude explanations, it can be seen that data from quasi-experimental and non-experimental studies are more ambiguous than experimental data. This is the answer to Question [Q2]. The comparison between Tables 5 and 6 provides the answer to Question [Q1]. These answers to Questions [Q1] and [Q2] lead to the realization that some criticisms of NHSTP are motivated by ambiguities in data interpretation. At the same time, a few criticisms arise because the nature of H0 is misunderstood or misrepresented.

11. The Nature of H0

What is clear from the discussion of Tables 2 and 3 is that whether or not the experimental data support the substantive hypothesis is not determined by NHSTP. Supplying the minor premise for the first syllogism in Table 3 or 4 is the only contribution NHSTP has to theory corroboration. The theoretical meaning of the experimental data is conferred by their logical relation with the experimental, research, and substantive hypotheses. Although statistical significance does not confer any theoretical meaning to data, it does have an important function. Specifically, it provides a rational basis for excluding chance influences as an explanation of data. This important (although limited) role may be seen from a closer examination of the statistical null hypothesis, H0.

12. H0, Data and Chance Influences

One way to paraphrase the antecedent of [P1.3'] in Table 1 is to say that the subjects are indifferent to whether the to-be-remembered sentence is a kernel or a negative sentence. Consequently, under such circumstances, any observed difference between the means of the 'Negative' and the "Kernel' conditions is the result of chance influences (or errors). That is, actual measurements made during data collection may be affected by unintended non-systematic influences (i.e., errors) of various kinds. Consequently, [P1.3'] in Table 1 may be represented as the conditional proposition, [P7.1], in Table 7. By the same token, [P1.4] in Table 1 may be represented by [P7.2] in Table 7.

Table 7. The statistical null hypothesis (H0) and the statistical alternative hypothesis (H1) as components of conditional propositions

Where in Table 1

Conditional Proposition


[P1.4']
If chance, then H0.
[P7.1]
[P1.4]
If not chance, then H1.
[P7.2]

If Ho, then the test statistic is distributed as a sampling distribution of the difference whose mean difference is zero.

[P7.3]

The representation adopted for H0 and H1 in Table 7 serves three functions. First, it highlights the meaning of the null hypothesis. It is a hypothesis about the influence of non-systematic chance factors on data in the form of distributing the unintended influences randomly between the two conditions. Moreover, the errors are normally distributed with a mean of zero in each condition. Consequently, a statistically significant result will be correctly interpreted to mean only that an explanation of the data in terms of chance influences can be excluded with the level of strictness stipulated by the significance level (viz., a).

Second, Table 7 makes explicit the mutually exclusive and exhaustive relationship between H0 and H1. That is, the contrast between H0 and H1 is informed by neither the substantive hypothesis nor the to-be-studied phenomenon. Instead, the contrast is informed by the data-collection procedure. It is a contrast between chance and not chance. That NHSTP is actually mute at the level of the substantive hypothesis may be seen from the fact that, in the event the result is statistically significant, the non-chance factor responsible for the data is not informed by statistical significance.

The third function of the tabular representation of Table 7 is to make explicit the fact that H0 is not used as a categorical proposition. It appears twice; once as the consequent of the conditional proposition [P7.1], and once as the antecedent of the conditional proposition [P7.3]. This state of affairs means that, even if 'H0 is never true' were true, its contribution to the statistical decision process would not be affected because the truth of either [P7.1] or [P7.3] is not determined by the truth value of H0 alone, but by the truth values of both the antecedent and consequent (Copi, 1982). At the same time, it is important to emphasize that H0 can (and should) be true, the common belief to the contrary notwithstanding.

13. 'H0 is never true' Revisited

Consider the antecedent of the conditional proposition, [P1.3'] in Table 1. It says that there is no difference in difficulty in processing negative and kernel sentences. In other words, H0 is a hypothesis about the relationship between two theoretical populations, 'Kernel' and 'Negative' (viz., the hypothesized population of all subjects presented with kernel sentences and that of all subjects presented with negative sentences). In view of the fact that two populations are implicated in H0 (not just one), it is not clear what H0 is about when only one population is acknowledged, as in the statement, 'A null hypothesis is any precise statement about a state of affairs in a population, usually the value of a parameter, frequently zero' (Cohen, 1990, p. 1307, emphasis in italics added).

The assertion, "things get downright ridiculous when H0 is to the effect that the effect size (ES) is 0--that the population mean difference is 0" (Cohen, 1994, p. 1000, emphasis in italics added), is questionable for a different reason. Two theoretical populations are properly recognized in this statement if 'population mean difference' refers to the mean of the sampling distribution of differences. It needs two population distributions to give rise to a sampling distribution of differences. However, it can be shown that it is not ridiculous to have a mean difference of zero for the sampling distribution of differences.

Recall the two theoretical populations, 'Kernel' and 'Negative,' in the kernel-negative experiment They are procedurally defined populations. Specifically, they are defined in terms of the two levels of the independent variable, Sentence-type. The data-collection situation in experimental psychology can be (and should always) made to ensure that the two procedurally defined populations be identical if the subjects are indeed indifferent to the difference between the two levels of the independent variable. This is effected in different situations by using the repeated-measures design, the matched-pair design, or the completely randomized design.

As an example, consider the repeated-measures design. The two test conditions (viz., presenting kernel sentences and presenting negative sentences) are imposed on the same group of subjects. This group of subjects becomes two hypothetical samples when described in terms of the two respective levels of the independent variable. The two hypothetical samples are identical before being exposed to the experimental manipulation. They remain identical if what is said in the experimental hypothesis is false. Why should the 'Kernel' and 'Negative' populations not have the same mean if the complement of the experimental hypothesis is true? Why is it ridiculous to expect the difference between the 'Kernel' and 'Negative' populations be zero at the statistical level if the subjects are indifferent to the experimental manipulation? In other words, critics have not taken into account the fact that the null hypothesis is about neither the to-be-studied phenomenon nor some actual substantive populations. The null hypothesis is about the relationship between two or more procedurally defined hypothetical populations.

It is important to emphasize that the truth of HO depends on assigning subjects randomly to the experimental and control conditions or using the same subjects in both conditions. This iteration is necessary in view of a recent attempt to question the assertion, `HO is never true,' with the following debatable scenario:

We give a placebo to a control group and [the to-be-tested] drug to the experimental group. We then mix these participants into one group .. (Hagen, 1997, p. 16)

The data collection procedure depicted is unsatisfactory because it does not guarantee that the formal requirement of Mill's (1973) method of difference is met. This example may also be used to make the case the validity of NHSTP must be assessed in the context of research methods.

In short, H0 can be true. More important, it ought to be true if the data-collection procedure is set up and conducted properly (hence, the importance of Cohen's, 1994, and Meehl's, 1990, caveat identified in Question [Q2]). The assertion, 'H0 is never true,' seems self-evident only when H0 is used as a categorical proposition descriptive of an ill-defined state of affairs. On the contrary, it is actually a statement about how the data are collected, a point also noted by Bakan (1966) and Phillips (1973).

More important, H0 is never used as a categorical propositions. At one level of discourse (viz., [P1.4'] in Table 1), H0 is a description of the data when certain assumptions or conditions are satisfied in the data-collection situation (a point emphasized by Falk & Greenbaum, 1995). At a different level of discourse (i.e., [P1.5'] in Table 1 or [P7.1] in Table 7), H0 is a criterion for rejecting chance influences as an explanation of data. What renders H0 indispensable is that it stipulates the to-be-used sampling distribution of the test statistic required for making the decision about chance influences (see [P7.3] in Table 7 or [P1.5'] in Table 1).

14. The Ambiguity-Anomaly Criticisms of NHSTP

A statistically significant result is considered ambiguous by critics. They also find the relationship between statistical significance and the effect size anomalous. The ambiguity and anomaly stem from the fact that statistical significance may be the fortuitous consequence of having chosen a particular sample size. Consider Studies A and B in Table 8.

Table 8. The putative ambiguity and anomaly of significance tests illustrated with four fictitious studies

Study

uE

uC

Effect size*
Statistical Test (e.g., t ) significant?

df

A
6
5
0.1
Yes
22

B

25
24
0.1
No
8

C

17
8
0.9
No
8

D

8
2
0.5
Yes
22

* J. Cohen's (1987) d

Although the effect size is the same in both Studies A and B, the result is significant in Study A, but not B. At the same time, the sample size is larger in Study A than in Study B. This is the basis of the sentiment shared among critics that statistical significance is assured if a large enough sample is used (see Thompson, 1996, for a recent expression of this view). By the same token, a result may be non-significant because too small a sample is used. This difficulty may be called the sample size-dependence problem.

Study A is significant and Study C is not significant. Yet, the effect size is larger in Study C than in Study A. This is considered an anomaly, and it may be called the incommensurate significance-size problem. This problem suggests to critics that statistical significance is misleading at best, harmful at worst. The harm NHSTP does to research is that it precludes researchers from utilizing more profitably the quantitative information in the data. Specifically, if researchers are satisfied with the NHSTP result, they may neglect to determine the confidence interval estimate of the parameter.

Studies A and D in Table 8 jointly show that the incommensurate significance-size problem may assume the form of the magnitude-insensitivity problem. Their results are significant. However, the effect in Study D is larger than that in Study A. This useful information is not put to good use. The same point may be illustrated with Studies B and C. Although their results are not significant, the effect is larger in Study C than B. Again, the magnitude of the effect should be used (e.g., in meta-analysis; Glass, McGaw, & Smith, 1981; Schmidt, 1996).

A closer examination of the following issues shows that these criticisms themselves are debatable. First, the ambiguity is a conceptual or methodological problem, not a quantitative issue. Second, the effect size and NHSTP express the difference between the means of the experimental and control groups at different levels of abstraction. Third, parameter estimation is not theory corroboration. Fourth, non-statistical concerns cannot be addressed with statistics indices. Fifth, the validity of meta-analysis, as a theory-corroboration tool, can be questioned.

15. The Sample Size-Significance Dependence Problem Revisited

A persistent theme found in criticisms of NHSTP is that the fortuitous choice of the sample size (e.g., an unjustifiably large sample) may be responsible for a statistically significant result. However, Questions [Q1] and [Q2] suggest that the issue may have nothing to do with the sample size at all. The real concern may be questions about the internal validity of the research (Campbell & Stanley, 1963; Cook & Campbell, 1979). Be that as it may, that statistical significance may be questioned suggests that there are good reasons why affirming the consequent of [MAJ-3.1], [MAJ-3.2] or [MAH-3.3] in Table 3 does not guarantee the truth of its antecedent. This may be seen more readily from the following non-experimental study.

Suppose that the effects of institutional constraints on a rehabilitation programme is assessed with a correlational study. It is found that the efficacy of the rehabilitation programme varies inversely with the number of institutional constraints. What does it mean to dismiss the study for the simple reason that the sample size is unusually large (e.g., n = 1,000)?

Note that to question the statistically significant result in this example is to question the conclusion that institutional constraints are really related to the failure of the rehabilitation programme. That is, this is a question about data interpretation (a conceptual concern), not about the numerical value of the test statistic or the sample size. Hence, it is necessary to consider the 'fortuitous sample size' argument more closely, not in quantitative terms, but in qualitative terms. That is, the issue is why it is more likely to introduce confounding variables when more participants are included in the correlational study.

To increase the sample size is to recruit more participants in the correlational study. Chances are that the participants would have to be recruited from more diverse settings. Consequently, not only does the chance of having a confounding variable increase, it also becomes more difficult to identity the confounding variable. The result (be it statistically significant or non-significant) becomes more ambiguous regarding the relationship between institutional constraints and the efficacy of the rehabilitation treatment of interest. More important, it would not be valid to apply the chain of reasoning depicted in Table 3 under such circumstances. As may be recalled from Table 5, the situation is very different in the case of the experiment because of experimental controls.

Why is the sample size-significance dependence problem not seen by critics as a concern about the internal validity of the research? The real source of the ambiguity is obscured by the suggestion that statistical significance may be manipulated by cynical researchers. Specifically, it is intimated by some critics that cynical researchers use excessively large samples if their interests are vested in a statistically significant result, but small samples if their vested interests are served by a non-significant result. However, that a tool may be misused speaks ill only of its users. It does not mean that the tool itself is unsatisfactory, particularly when nothing inherent in the tool invites its being misused.

It should be possible to dismiss the cynicism issue as irrelevant were there not the impression that psychologists accept (or do not accept) a research conclusion on the sole basis of statistical significance (or a non-significant result). The impression is misleading. For example, cognitive psychologists do not accept or reject a finding on the mere basis of statistical significance or non-significance (see, e.g., Coltheart's, 1980 or Haber's, 1983, discussion of the iconic store). Cognitive psychologists examine assiduously whether or not (a) a proper experimental design has been used in the experiment, (b) subjects have been given sufficient training, (c) all recognizable control variables or procedures are properly instituted, and (d) the correct statistical procedure is used.

In short, experimental psychologists are very meticulous about the internal validity of experiments (viz., both the inductive conclusion validity and statistical conclusion validity). They are aware that a statistically significant result may be ambiguous at the conceptual level as a result of various features found in the data-collection procedure or situation. In actual fact, experimental psychologists are so conscientious about the inductive conclusion validity issues that their attempts to eliminate conceptual or methodological ambiguities have recently been dismissed as 'methodolatory' (Danziger, 1990) or 'scientific rhetoric' (Gergen, 1991).

The realization that the ambiguity issue has nothing to do with NHSTP obviously has important implications on how to reduce ambiguity. For example, the ambiguity cannot be reduced by testing more subjects or analyze parts of the data (as envisaged in Hunter & Schmidt's, 1990, psychometric meta-analysis). Nor can another numerical index be used to disambiguate the statistically significant result (be it the effect size or statistical power). It is instructive to recall the following observation:

The sum total of the reasons which will weigh with the investigator in accepting or rejecting the [substantive] hypothesis can very rarely be expressed in numerical terms. All that is possible for him is to balance the results of a mathematical summary, formed upon certain assumptions, against other less precise impressions based upon [daggerdbl] priori or [daggerdbl] posteriori considerations. (Neyman & Pearson, 1928, p. 176; emphasis in boldface and explication in square brackets added) [Quote 1]

Two obvious examples of Neyman and Pearson's (1928) numerical terms are statistical power and the effect size. An example of the [daggerdbl] priori considerations is the choice between the repeated-measures and completely randomized designs. The consideration as to whether or not there is any confounding variable after the completion of the experiment is an example of the a posteriori considerations in question.

16. Two Levels of Abstraction - Statistical Significance and Effect Size

An assumption must be made explicit before one can assess whether or not Studies A and C in Table 6 suggests that statistical significance is anomalously related to the effect size. Specifically, it is necessary to assume that statements about statistical significance and the effect size are at the same level of abstraction. A look at how t and the effect size are respectively defined in Equations [Eq. 1] and [Eq. 2] suggests otherwise.

[a] t = {(Mean 1 - Mean 2) - (u1 - u2)}/standard error of differences [Eq. 1]

[b] d = (Mean 1 - Mean 2)/standard deviation of Group 1 [Eq. 2]

The (u1 - u2) component of the numerator of Equation [Eq. 1] is zero if the implication of chance influences is that u1 = u2 (Kirk, 1984). Consequently, the numerator is the same in both equations, namely, the difference between the two sample means. On the one hand, the denominator in [Eq. 1] is the standard error of differences. It is a property of a theoretical distribution, namely, the sampling distribution of differences. This distribution is at a level more abstract than the population of raw scores. The denominator in [Eq. 2], on the other hand, is the standard deviation of one of the two conditions in [Eq. 2]. This is a property of the population of raw scores. It follows that the test statistic used in NHSTP and the effect size are indices belonging to two different levels of abstraction. It seems neither valid nor appropriate to say that the relationship between statistical significance and the effect size is anomalous under such circumstances. This issue of mixing two levels of abstraction will surface again in the discussion of power analysis.

17. Effect Size, the Binary NHSTP Decision and Evidential Support

Two points are emphasized in the anomaly critiques of NHSTP. First, the NHSTP result is a binary decision (i.e., significant versus non-significant). Second, the effect size is a continuous variable. However, the propriety of juxtaposing statistical significance and the effect size may also be questioned for the following reasons. First, these criticisms are made with the assumption that H1 is the substantive hypothesis. However, critics have not taken into account the facts that H1 is the complement of H0, and that H0 is a hypothesis about chance influences on data. In other words, H1 is neither the substantive nor the experimental hypothesis. It is but a statement to the effect that chance influences may be ruled out as an explanation of data. Consequently, to say that the result is statistical significant is to say something about the data and their collection. Statistical significance does not say anything about the substantive hypothesis.

The second reservation about critics' juxtaposing statistical significance and the effect size is a meta-theoretical one. To suggest supplementing statistical significance with the effect size in the theory-corroboration experiment is to say that the effect size has something to contribute to the evidential support for the substantive hypothesis. In view of the argument that the warranted assertibility offered by experimental data is conferred by the implicative relations among the quartet of hypotheses (see Table 3) and the inductive principle underlying the experimental design (see Table 5), not by statistics, the putative importance of the effect size can be discounted. The effect size has no role in either the deductive or the inductive reasoning depicted in Tables 3 and 4. It follows that a larger effect size does not mean a greater support for the substantive hypothesis (see also Chow, 1988). At the same time, the binary NHSTP suffices to provide the minor premise for the first conditional syllogism depicted in Table 3.

18. Effect Size and Practical Importance

Something seems amiss to critics when nothing can be learned about the practical impact of the statistically significant research result. It is suggested that this shortcoming is the result of relying on NHSTP. Moreover, it can be rectified by reporting the effect size, particularly when the binomial effect-size display (BESD) is used (Rosenthal & Rubin, 1979, 1982). This may be called the 'effect informs impact' claim. Of interest are (a) the fact that the argument in support of the claim is incomplete, and (b) the reason why the claim intrudes into the assessment of NHSTP. This discussion will make it understandable the unwarranted practice of conflating statistical hypothesis testing with theory corroboration.

19. The 'Effect Informs Impact' Claim Revisited

There is a conceptual gap in the 'effect informs impact' claim. Consider the correlation coefficient, r, between medication (aspirin versus placebo) and myocardial infarction, MI (absence or presence) in Rosnow and Rosenthal's (1989) illustration. The r is used as an index of the effect size. What BESD does effectively is to convert the Pearson r = .034 into the 'change in success rate' in the form of a percentage, where by 'success' is meant the absence of MI in the illustration. The 'success rates' for the aspirin and placebo conditions are given, respectively, by Equations [Eq. 3] and [Eq. 4] as follows:

[a] The success rate for the Aspirin Condition: .5 + r/2 [Eq. 3]

[b] The success rate for the Placebo Condition: .5 - r/2 [Eq. 4]

The change in success rate is simply the difference between [a] and [b]. It turned out to be 3.4%. The conclusion is drawn that the implications of an effect of this magnitude is 'far from unimpressive' (Rosnow & Rosenthal, 1989, p. 1279), despite the fact that an r = 0.034 is statistically non-significant.

The BESD is justified on the grounds that it is 'intuitively appealing ... [and] easily understood by researchers, students, and lay persons' (Rosenthal, 1983, p. 11). The difficulty is that the validity of this justification itself is by no means self-evident. It is simply not clear why the said rate of 3.4% is impressive. Would the same rate of change be impressive if the research is about the attitude change of some obscure film critics? Would it be more impressive if the film critics are prominent ones? It seems that, in the 'Aspirin-MI' example, a change in success rate of 3.4% owes its impressiveness to the nature of the to-be-monitored phenomenon (viz., incidents of MI), not to the magnitude of the change itself.

There is also the following question. To whom is the effect size impressive? A 3.4% change in the attitudes of film critics may not impress those who are interested in artistic issues. However, it may have a greater impact on film producers when they consider the monetary implications. In other words, impressiveness is in the eye of the beholder, not the size of the effect per se.

In short, by itself, the effect size says nothing about the practical impact of the result. What is required is some criteria that relate the effect size to the judgment about impressiveness or practical impact. These criteria are outside the domain of statistics. Moreover, these criteria are domain-specific. Consequently, the claim that BESD is the general purpose index of practical impact is questionable. At the same time, the propriety of criticizing NHSTP in terms of practical validity may also be questioned because statistics and practical impact belong to different domains

20. The Intrusion of Non-statistical Issues

The kernel-negative experiment used to introduce the rationale and procedure of NHSTP is like neither the examples used to introduce NSHTP in statistics textbook nor those used in criticisms of NHSTP. The commonly used examples are studies used to ascertain the effectiveness of a course of action or treatment (e.g., using a new method to teach statistics). Typically, the new method is applied to one class of students, whereas the traditional method is used in another class of students. The mean performance of the two classes is tested with NHSTP. The only concern is whether or not the new method of teaching produces a better result. This is an issue about treatment assessment. The question as to why the new method produces a better result is often not an issue. Experiments of this type are tokens of the agricultural model experiments (Hogben, 1957; Meehl, 1978; Mook, 1983). Given their pragmatic objective, these experiments may also be characterized as utilitarian experiments. To see why non-statistical issues intrude into the discussion of the role of NHSTP in empirical research, it is necessary to consider the nature of the utilitarian experiment.

21. The Differences Between the Utilitarian and Theory-corroboration Experiments

It may be recalled from Table 1 that experimental data in the theory-corroboration experiment are at increasing deductive distances from the experimental, research and substantive hypotheses. As may be seen from Table 9, the same is not true of the utilitarian experiment for the following reason. Given the specificity of the objective, the choice of the independent and dependent variables in the utilitarian experiment is restricted by the research objective itself. This, in turn, determines the experimental and research hypotheses. Consequently, the statistical and substantive hypotheses are indistinguishable.

Table 9. The logical relations among the to-be-investigated phenomenon, pragmatic, research, and experimental hypotheses of the utilitarian experiment


What Is Said At The Level Concerned


To-be-investigated phenomenon

A dissatisfaction with students' current understanding of statistics.

Substantive (pragmatic) hypothesis

Method E is more effective than Method C.
[P9.1]

Research Hypothesis

If [P9.1], then Method E produces better understanding than Method C.
[P9.2]
Experimental Hypothesis
If the consequent of [P9.2], then students taught with Method E have higher scores than those taught with Method C.
[P9.3]
'Statistical Alternative Hypothesis'
If consequent of [P9.3], then H1.*
[P9.4]
Sampling Distribution of H1

If H1, then the probability associated with a difference between Methods E and C as extreme as 1.729 standard error units from an unknown mean difference is not known (assuming df=19).

[P9.5]

Sampling Distribution of Ho

If Ho,Ü then the probability associated with a difference between Methods E and C as extreme as 1.729 standard error units from a mean difference of zero is 0.05 in the long run (assuming df=19).

[P9.5']

*H1 =

mean of Method E > mean of Method C.
ÜHo =
mean of Method E mean of Method C.

Additional differences between the utilitarian and the theory-corroboration experiments have been shown in Table 10. These differences may be used to understand, as well as to answer, some of the criticisms of NHSTP. To begin with, it has been noted that the impetus of the utilitarian experiment is primarily, if not exclusively, to find the solution to a practical problem (e.g., students' poor understanding of statistics; see Row 1 in Table 1). That is, the role of a theory is minimal, if there is one at all, in this kind of experiments (hence, the 'atheoretical' characterization in Row 4).

Table 10. Some differences between the agricultural (utilitarian) model and theory-corroboration experiments



Agricultural Model

(Utilitarian)

Theory Corroboration

1
Impetus
To solve a practical problem; reflexive of data collection

To explain a phenomenon; independent of data collection

2
Subject Matter
The practical problem involving observable events
Unobservable hypothetical entity and its theoretical properties

3

Consequence of Research
Take a particular course of action; closure of investigation

Accept tentatively, revise or reject the theory; no closure to the investigation

4

Role of Theory
Atheoretical
To-be-test theory explicitly stated; used to guide experimental design

5

Substantive Question
'Is the treatment effective?'

'How effective is the treatment?'

'Why does the phenomenon occur?'

6
Experimental Hypothesis
The practical question itself
Qualitatively different from the to-be-assessed substantive hypothesis

7

Experimental Manipulation
The to-be-assessed efficient cause itself

Different from the to-be-explained phenomenon

8

Dependent Measure
The practical problem itself
Different from the to-be-explained phenomenon

9

Statistical Significance

To indicate that the explanation of data in terms of chance variations can be ruled out at the a level.

To indicate that the explanation of data in terms of chance variations can be ruled out at the a level.

10

Effect
Substantive efficacy (i.e., the consequence of an efficient cause)

The difference between the means of two conditions (i.e., the consequence of a formal or a material cause)

11

Ecological Validity
Necessary
Irrelevant, may even be detrimental

Suggestive of this difference is the fact that, whereas unobservable hypothetical entities or processes (e.g., the language processor) are the concerns of the theoretical endeavor in the theory-corroboration experiment, the subject matters of utilitarian experiments are observable activities or events (e.g., students' test scores; see Row 2). The result of the utilitarian experiment is used to guide a particular course of action (e.g., whether or not to adopt the new method of teaching; see Row 3). Experimental data in the theory-corroboration experiment, on the other hand, are used to assess whether or not there is evidential support for an explanatory substantive hypothesis (see Row 3). No pragmatic course of action follows. Nor is any practical problem solved as a result of the theory-corroboration experiment.

The experimental manipulation in the utilitarian experiment is the to-be-assessed efficient cause itself (e.g., the new method of teaching versus the traditional teaching method; see Row 7). However, the independent variable used in the theory-corroboration experiment is not an efficient cause. For example, the presentation of a kernel or a negative sentence does not shape or constrain subjects' behaviour in the way a teaching method may shapes students' learning. In presenting kernel and negative sentences, the experimenter provides the hypothetical linguistic processor different contexts or environments in which to exhibit its theoretical properties. In other words, the independent variable in the theory-corroboration experiment is either a formal or a material cause, not an efficient cause.

22. 'Effect' - Vernacular and Technical Meanings

The contrast between the independent variable as the efficient cause in the utilitarian experiment versus its being the formal (or material) cause in the theory-corroboration experiment has important implications on how 'effect' or 'effective' is understood in the context of NHSTP. 'Effect' is used in its vernacular sense in the ambiguity-anomaly and the insensitivity to effect size criticisms of NHSTP. This is also the sense assumed (as well as congenial to) the utilitarian experiment (see Rows 5 and 10 in Table 10). This is understandable in view of the fact that the experimental manipulation itself is substantively efficacious (e.g., methods of teaching). However, this does not mean that it is justified to do so when the independent variable is not an efficient cause (e.g., sentence type). What is important is that it is also not justified even when the experimental manipulation consists of two efficient causes, but for a different reason.

To adopt the vernacular meaning of 'effect' is to use a statistically significant result to do something more than rejecting chance influences as an explanation. It is to assert that the research manipulation is the explanation (see also 'The Specificity of H1 and Related Issues' section below). However, this assumption is justified only to the extent that the inductive conclusion validity is assured. In fact, as has been noted earlier in the 'Sample Size-Significance Dependence Problem Revisited' section, questions about a statistically significant result arise because there are doubts about the inductive conclusion validity. More important, these questions are not statistical ones. Consequently, it is doubtful that specifying the effect size or determining the confidence interval estimate would allay the non-statistical concerns that underlie the reservations about the statistically significant result.

Recall that H0 is a statement about the consequence of chance influences on data collection. 'Effect' at this level of discourse refers to the difference between the means of two data collection conditions. The NHSTP concern is whether or not the difference is large enough for the rejection of the explanation in terms of chance influences. This technical meaning of 'effect' is different from its vernacular meaning. It does not implicate any assumption of efficacy. More important, by itself, NHSTP does not identify the reason for the sufficiently large difference that leads to the 'statistically significant' decision. Nor should there be any reason to expect an answer coming from NHSTP when the issues implicated are nonstatistical ones.

In sum, critics' concern about the effect size may be represented by the questions tabulated in the left-hand column of Table 11 (see Rosnow & Rosenthal, 1989). These questions are asked because 'effective' is interpreted in its vernacular sense. However, Question [PV-2] does not directly lead to [PV-3] or [PV-4]. It is necessary to provide an independent set of criteria outside the domain of statistics to justify asking Question [PV-3] or [PV-4] in conjunction with Question [PV-2] (see Section 19, "The 'Effect Informs Impact' Claim Revisited"). Such a set of criteria is not available.

Table 11. Different sets of research questions pertinent to practical validity (PV) and conceptual rigor (CR) for the utilitarian and theory-corroboration experiments, respectively


Practical Validity Concerns

(Utilitarian Research)


Conceptual Rigor Concerns

(Theory-corroboration Research)


The independent variable is the efficient cause.


The independent variable is the material or formal cause.

[PV-1]
Is Treatment T effective?
[CR-1]
Is Treatment T effective?

[PV-2]

How effective is Treatment T?

[CR-2]

Is the independent variable a valid choice?

[PV-3]

How impressive is Treatment T?


[CR-3]

Do the data warrant the acceptance of Theory K which underlines the choice of the dependent variable?

[PV-4]

Is Treatment T important?
[CR-4]
Is the implementation of the independent variable valid?


[CR-5]

Does the study have hypothesis validity?

Suppose that the technical meaning of 'effect' is adopted in discussing NHSTP. Although Question [CR-1] is literally the same as [PV-1], it leads to an entirely different set of questions relating to the difference between two data-collection conditions brought about by the experimental manipulation. It may be seen readily that, with the exception of Question [CR-5], these are questions about the data-collection conditions, particularly the inductive principle that underlies the experimental design.

23. Power Analysis

The power of a statistical test has recently become an important consideration in the assessment of empirical studies in psychology. Cohen's (1987) power analytic approach to empirical research has the following themes. First, if Phenomenon P exists, its effects must be detectable. Second, the evidence for the truth of a substantive hypothesis about Phenomenon P is the detectability of the effect envisaged in the hypothesis. Third, the substantive hypothesis is represented by H1 in NHSTP. Fourth, to detect the effect is to obtain statistical significance (i.e., to accept H1 by rejecting H0). Hence, statistical significance is indicative of the truth of H1 or the fact that Phenomenon P exists. These four inter-connecting themes may collectively be identified as the existence-detectability-significance thesis. For this reason, it is important for power analysts to know the a priori probability of obtaining statistical significance. That a priori probability is the power of the statistical test (Cohen, 1987; see also Mosteller & Bush, 1954).

The Type II error is assumed by critics to have real-life consequences. Hence, NHSTP users are faulted for ignoring it as a result of their exclusive obsession with the Type I error. With the advent of power analysis, the Type II error can now be controlled by specifying the level of statistical power desired for the investigation. This is possible because the power of a statistical test is (1 - ), where is the probability of committing the Type II error. The value of can be controlled by setting the level of the power.

That power analysis is currently well received is understandable in view of the facts that critics are convinced that NHSTP is problematic and that power analysis is presented as a remedy for the difficulties of ambiguity and anomaly attributed to NHSTP. However, if the criticisms of NHSTP themselves are debatable, it should become easier to consider power analysis in a more judicious way. There are good reasons to question the existence-detectability-significance thesis of power analysis.

Consider its first theme, namely, that if H1 is true, there is a detectable effect. This theme is contrary to the fact that the tenability of some hypotheses depends on not rejecting H0 (i.e., not detecting any effect, in the parlance of power analysis). An example is Schneider & Shiffrin's (1977) study of automatic detection. The third theme of the thesis that H1 is the substantive hypothesis is debatable in view of the quartet of hypotheses identified in Table 1 and the discussion in Section 12, 'H0, Data and Chance Influences.' Consequently, all power analytic assertions based on identifying H1 with the substantive hypothesis are questionable.

The detectability of the effect is equated with statistical significance in the second theme of the existence-detectability-significance thesis. This makes explicit that an implicit assumption in power analysis that NHSTP is not different from the theory of signal detection procedure (TSD). An examination of this NHSTP-TSD affinity assumption reveals additional conceptual difficulties in power analysis.

24. The NHSTP-TSD Affinity in Power Analysis

Indicative of the NHSTP-TSD affinity envisaged in power analysis are assertions like "Since effects are appraised against a background of random variation" (Cohen, 1987, p. 13), and "[the said appraisal consists of] detecting a difference between the means of populations A and B ... " (Cohen, 1987, p. 6, emphasis in italics added). At the level of rationale, the appeal is made to Neyman & Pearson's (1928) emphasis on the posterior probability. It is believed that researchers first determine what a sample statistic is (e.g., the sample mean). They then ask (or wish to ask) what the probability is that the sample has been selected from Population P with parameter u (see Cohen, 1994). An appeal to the a posteriori probability in this "from sample statistic to population parameter" manner is also found in a TSD analysis.

It is recognized in TSD that an observer's response bias is a function of the prior odds (viz., the probability of the noise event to that of the signal event) and the payoff matrix (i.e., the costs for committing errors and the gains due to making correct detection). Something very similar is suggested in power analysis. Specifically, it is suggested that the placement of the decision axis used to make the statistical decision should reflect a balance struck between statistical power and a (Cohen, 1987, p. 5). This is achieved by taking into account the ratio of the probability of the Type II error to the probability of the Type I error. Researchers are further urged to pay attention to "... the relationship between n and power for [their] situation, taking into account the increase in cost to achieve a given increase in power ..." (Cohen, 1965, p. 98; Cohen's emphasis in italics).

25. Issues Raised by the NHSTP-TSD Affinity

A correspondence between two sets of descriptive terms becomes obvious if the affinity between NHSTP and TSD is recognized. Of particular interest is that between statistical power and hit rate. It renders questionable the following assertion,

The power of a statistical test is the probability that it will yield statistically significant results. (Cohen, 1987, p. 1, emphasis in italics added) [Quote 2]

26. Statistical Power - A Conditional Probability

The Type I error is made when the researcher rejects a true Ho; this is analogous to committing a false alarm in TSD. Power analysts use [H1 True] as a sub-column heading in the upper left panel of Table 12. The Type II error is committed when the researcher fails to reject H0 when H1 is true. The logical complement of Type II error (viz., rejecting H0 when H1 is true) in NHSTP is equivalent to hit in TSD (see the upper right panel). Note that a hit in TSD refers to a "Yes" response contingent on the presence of a signal event. That is, a hit is a characterization of the observer's behavior, given that the signal is present. It says nothing about the signal event per se. It follows that the hit rate in TSD is a conditional probability, namely, the probability of an observer's saying "Yes" when a signal event indeed occurs. In other words, the hit rate says nothing about the exact probability of the presence of a signal event.

Table 12. The correspondence between some concepts (upper left) and their probabilities (lower left) in NHSTP and concepts (upper right) and their probabilities (lower right), given the NHSTP-TSD affinity in power analysis.

Upper Panel

NHSTP Concepts




TSD Concepts


Decision

State of Affairs


TSD Response

State of Affairs


H0 True

H0 False

[H1 True]



Noise

Signal
"Not Reject"
Correct acceptance
Type II error

"No"

Correct rejection
Miss
"Reject"
Type I error
Correct rejection

"Yes"
False alarm
Hit







Lower Panel

NHSTP Concepts




TSD Concepts


Decision

State of Affairs


TSD Response

State of Affairs


H0 True

[H1 True]


Noise

Signal
"Not Reject"
p(Correct acceptance)
p(Type II error) = beta

"No"

Correct rejection rate

Miss rate

"Reject"
p(Type I error) = a
Power = (1 - beta)

"Yes"
False alarm rate
Hit rate







At the same time, as may be seen from the two lower panels of Table 12, the TSD analog of statistical power is the hit rate. Hence, the statistical power is a conditional probability (see also Chow, 1991c). That is to say, knowing the power of a test (a conditional probability) is not knowing the probability of obtaining statistical significance (an exact probability). More important, given the NHSTP-TSD affinity, the statistical power index says something about the researcher, not H1, in much the same way the hit rate says something about the observer, not the signal event. In short, statistical power does not (and cannot) enlighten us as to the probability of obtaining statistical significance.

27. Statistical Power - A Misleading Sense of Efficacy

An efficacious capability is attributed to the statistical procedure in [Quote 2]. It suggests that statistical significance is reached by virtue of the numerical index, statistical power (see the emphasis in italics [Quote 2]). This assertion is misleading because, at the level of statistics, statistical power simply refers to the cumulative probability over a range of parameter values (viz., all values that are as extreme as the critical values of the test statistic). No efficacy of any sort is implicated at this level of discourse. A nonstatistical theoretical justification is required if an efficacious capability is attributed to statistical power. As no such justification is offered, it is only proper not to attach any extra-statistical meaning to the term, statistical power.

There is no a priori reason why the decision to reject Ho in the event that it is false should not simply be called Type II correct decision. Power analysis may not be so readily accepted had a non-evocative term like not-fl been used instead of power. Perhaps an excess and unwarranted meaning is attributed to a conditional probability as a result of its being labeled with the evocative term, power, a connotative meaning of which is being efficacious. The same is also true of statistical significance.

28. Graphical Representation of Statistical Power, Effect and NHSTP

It is taken for granted in the discussion so far that the concept, statistical power, is valid. The validity of power analysis becomes more questionable if there are reservations about the validity of statistical power itself. That [Quote 2] is inconsistent with statistical power being a conditional probability is one such reservation. There are additional reservations.

29. Two Levels of Abstraction -- Statistical Significance and Statistical Power

Consider the assertion, 'A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects (Cohen, 1990, p. 1309). This assertion is made because of the functional relationship between statistical power and effect size (given n and a) envisaged in power analysis. This functional relationship is readily seen from Panels A and B of Figure 1. Before proceeding any further, it must be noted that Cohen (1965, 1987, 1992a, 1992b) does not use any graphical representation when he discusses statistical power, effect size or the functional relationship between the two. Nonetheless, Figure 1 is used for ease of exposition. Its use is justified by the fact that it is consistent with how d and statistical power are defined in power analysis.

FIGURES UNAVAILABLE IN THIS VERSION

Figure 1. The grapical representation of two effect sizes (Panels A & B), and the corresponding differences between two mens in raw-score units (Panels C & D), as well as in standard error units (Panel E)

The x-axis in both panels represents population scores (as stipulated by how d is defined in [Eq. 2] above). The left and right distributions in either panel represents the control and experimental distributions, respectively. The effect size is represented by the distance between the two distributions, and statistical power is represented by the area shaded with slanting lines. To power analysts, Panels A and B represent two situations in which the desired effect is larger in Panel B than in Panel A, and Panel B represents a more powerful test than Panel A. Of interest is whether or not research manipulations that are expected to be differentially efficacious would have different impact on NHSTP.

30. H0 and Research Manipulation Efficacy

The pair of population distributions in the 'small effect' situation (viz., Panel A in Figure 1) gives rise to the lone sampling distributions of the difference depicted in Panel C of Figure 1. Similarly, the pair of populations distributions in the 'large effect' situation brings about another lone sampling distributions of the difference (i.e., the one depicted in Panel D of Figure 1). The two sampling distributions of the difference in Panels C and D have the same standard error of the difference in the present example. However, the two sampling distributions cover different parts of the difference between two means continuum in raw-score units (viz., from -2.5 to 4.5 in Panel C versus from -0.5 to 6.5 in Panel D).

Consider the numerator used in calculating the test statistic, t. It is often written as (Mean 1 - Mean 2). However, it is really a short-hand form for [(Mean 1 - Mean 2) - (_1 - _2) = 0]. As has been noted before, the (_1 - _2) component is left out when it is numerically equal to 0 (see Kirk, 1984). The distribution in the top panel of Figure 2 represents a sampling distribution of the difference for a situation in which (_1 - _2) = 0. That is, the mean difference of the sampling distribution of the difference between two means is zero.

FIGURES UNAVAILABLE IN THIS VERSION

Figure 2. The sampling distribution of the difference in raw-score units when the mean difference is 0 (top panel), 1 (middle panel) and 3 (bottom panel)

Power analysts suggested that the desired difference, (Mean 1 - Mean 2), may be 3.0 (or any definite value; e.g., 1), rather than 0. The numerator now becomes [(Mean 1 - Mean 2) - (_1 - _2) = 3.0] or [(Mean 1 - Mean 2) - (_1 - _2) = 1.0] in such an event. That is, the mean difference of the sampling distribution of the difference implicated in NHSTP is 3.0 (or 1.0), and it is graphically represented in the bottom (or middle) panel of Figure 2. The three sampling distributions in the three panels of Figure 2 have the same standard error of differences, but different values for the mean difference (viz., 0, 1 and 3.0). They represent the sampling distribution under Ho in three different situations. Specifically, the bottom panel represents a research manipulation expected to be more efficacious than the one depicted in the middle or top panel.

Represented on the x-axis of the graphical representation in any panel of Figure 2 is the range of possible values of the difference between two means. In other words, the three panels in Figure 2 collectively show that the difference in the expected efficacy of the research manipulation is represented by the spatial displacement of the sampling distribution of the differences between two means along the continuum of all possible values of the difference between two means. This state of affairs is different from the impression conveyed by Panels A and B in Figure 1.

In carrying out NHSTP, only one sampling distribution is used (viz, the one contingent on H0 being true). Moreover, the researcher uses a standardized form of the sampling distribution depicted in either Panel C or D of Figure 1 (viz., the z or t distribution; see Siegel, 1956). That is, regardless of the mean difference in raw-score units, the standardized representation of the to-be-used sampling distribution of the difference remains the same (viz., Panel E in Figure 1). More important, the location of the decision axis vis-[daggerdbl]-vis the mean of the sampling distribution of the difference remains unchanged for the same a level. It follows that the outcome of NHSTP is not affected by the desired effect or expected efficacy of the research manipulation.

Figure 1 shows that two distributions of population scores converge on one standardized distribution via a lone sampling distribution of the test statistic. Panel A or B in Figure 1 shows that it takes two population distributions to depict statistical power, whereas Panel E shows that only one sampling distribution is used to depict NHSTP. Moreover, two different levels of discourse are implicated in Panel A (or B) and E. This demonstrates that it is impossible to represent graphically statistical power without misrepresenting NHSTP. It casts doubts on the validity of the concept, statistical power.

Some important points may now be summarized. First, no distribution based on H1 is implicated in NHSTP (see Panel E of Figure 1). Second, the mean difference in raw-score units of the sampling distribution of difference reflects the theoretical difference between two population means. When expressed in terms of the raw-score unit, this difference is graphically represented by the spatial displacement of the sampling distribution on the difference between two means continuum (see the three panels in Figure 2).

Third, it is not possible to represent graphically the conditional probability, statistical power, if the rationale of NHSTP is properly represented with a single sampling distribution of the difference between two means. Fourth, the desired effect of the research manipulation (in the technical sense of the word) has no impact on NHSTP because the to-be-employed sampling distribution is standardized (e.g., in the form of the appropriate t distribution) before being used to make the 'chance versus non-chance' decision.

31. The Specificity of H1 and Related Issues

For non-power analysts, 'Type II error' in the upper left panel of Table 12 refers to the error committed when a false H0 is not rejected (i.e., ignore the [H1 True] column heading). No mentioned is made of H1 in this definition. It may be recalled from the lower panel of Table 2 that H0 and H1 are mutually exclusive and exhaustive. This is emphasized in Table 7 by depicting that H0 is the implication of chance influences, and that H1 is the implication of some ill-defined non-chance influences. It follows that, while H0 and H1 are mutually exclusive and exhaustive alternatives, 'H0 False' is not synonymous with 'H1.'

Defining 'Type II error' in terms of' 'H0 False' instead of [H1 True] in the upper left panel of Table 12 helps to maintain the distinction between inductive conclusion validity and statistical conclusion validity. Specifically, while NHSTP is used to decide between chance influences and non-chance influences (see Tables 2 and 7), inductive reasoning is employed to identify the non-chance factor involved (see Table 5). Also important is that H1 is numerically non-specific (see [P1.5] in Table 1).

In order to defined power, 'Type II error" is defined in power analysis as the error committed in the event that H1 is true. That is, it is necessary to use the [H1 True] heading in the upper left panel of Table 12. Moreover, H1 is given a specific non-zero numerical value in power analysis. This changes effectively the conceptual meaning of H1 from being an implication of non-chance influences to being the consequence of a specific efficient cause. This is reminiscent of the consequence of using 'effect' in its vernacular sense discussed in Section 22, "'Effect' - Vernacular and Technical Meanings.' Consequently, H0 and H1 are no longer mutually exclusive and exhaustive in the power analytic account of NHSTP. More important, in making the meaning of H1 numerically specific, power analysts may have eschewed the distinction between the two types of internal validity. NHSTP is given the additional role that should be played by inductive logic.

The power analytic practice of making H1 numerically specific is consistent with the Multiple-H1 Assumption view that there are, in fact, multiple numerical alternatives to H0 (Neyman & Pearson, 1928; Rozeboom, 1960). However, this assumption should have no bearing on NHSTP, as may be recalled from the 'H0 and Research Manipulation Efficacy' discussion in Section 30. Why is there the emphasis on multiple numerically specific H1's? The answer may be the fact that the term 'alternative hypothesis' is also used in another sense, albeit at a different level of discourse.

Given any to-be-explained phenomenon, there are alternative explanatory theories at the conceptual level (Popper, 1968a/1959, 1968b/1962). This state of affairs may be characterized as the Reality of Multiple Explanations view in subsequent discussion. In actual fact, different psychologists often explain the same phenomenon with various substantive hypotheses. Moreover, diverse hypothetical structures or functions are postulated in these competing theories.

For example, some psychologists prefer Fillmore's (1968) case grammar or Yngve's (1960) 'Depth' model to Chomsky's (1957) transformational grammar. These three substantive hypotheses lead to different research and experimental hypotheses ([daggerdbl] la the schema depicted in Table 1). As these experimental hypotheses may implicate different independent and dependent variables in diverse experimental situations, they lead to qualitatively different H1's. The distinction between the Multiple-H1 Assumption and the Reality of Multiple Explanations views depicted in Table 13 can be used to defend NHSTP against the Multiple-H1 Assumption critique of NHSTP.

Table 13 The distinction between statistical alternative hypothesis and alternative explanatory hypothesis


Multiple-H1 Assumption


Reality of Multiple Explanations

[a] H1:

(unegative - ukernel) < 0
[i] H1:
(unegative - ukernel) < 0

Ho:

(unegative - ukernel) = 0

Ho:

(unegative - ukernel) = 0

[b] H1':

(unegative - ukernel) = -3
[ii] H1':
(uca - ua) > 0

Ho':

(unegative - ukernel) = 0

Ho':

(uca - ua) = 0

[c] H1'':

(unegative - ukernel) = 5
[iii] H1'':
(un-p - un-n) 0

Ho":

(unegative - ukernel) = 0

Ho'':

(un-p - un-n) = 0

ca = counter-agent; a = agent

n-p = negative sentence with positive meaning; n-n = negative sentences with negative meaning

32. Alternative Substantive Hypothesis versus Statistical Alternative Hypothesis

Consider first the Multiple-H1 Assumption column in Table 13. In terms of the number of extra words recalled, H1 of the kernel-negative experiment is shown in Row [a]. Two additional alternatives to Ho are shown in Rows [b] and [c] in the Multiple-H1 Assumption column. Each one of these statistical alternative hypotheses is a point-prediction. However, if the Multiple-H1 Assumption and the Reality of Multiple Explanations view were the same, it would be necessary to show something like what follows: Alternative [a] is an implication of the transformational grammar, Alternative [b] is derived from the case grammar, and Alternative [c] follows logically from Yngve's (1960) 'Depth' model. Ironically, the researcher should be very unhappy about the three theoretical alternatives if this were the case for the following reason.

Such a state of affairs occurs when the three numerical alternatives are alternative outcomes in the same experimental context (e.g., the same independent variable is manipulated, as indicated by the subscripts used in the Multiple-H1 Assumption column). This takes place when there is no qualitative difference among the theoretical structures or mechanisms envisaged in the three substantive hypotheses. Consequently, they do not differ in terms of how well they explain our linguistic competence at the conceptual level. This means that the three hypotheses give the same qualitative prescription in a well-defined task context. In other words, the three hypotheses are merely variations of the same genre under such situations. The choice among the three alternatives becomes a non-theoretical one. In what sense does the quantitative difference in question matter if it does not make any difference at the explanatory level?

Consider now the 'Reality of Multiple Explanation' column in Table 13. To begin with, additional independent variables generally have to be added to the experiment in order to test multiple explanatory hypotheses simultaneously. For example, it may become necessary to manipulate Sentence Modality (e.g., agent versus counter-agent) in order to test the case grammar. The depth of the sentence structure would have to be manipulated if the 'Depth' model is being tested. Specifically, it may be necessary to manipulate the type of negative sentences being used (e.g., a negative sentence with a positive meaning versus one with a negative meaning).

A prerequisite of a successful theory-corroboration experiment is that different experimental prescriptions are prescribed by the qualitatively different theories. For example, the prescription of the transformational grammar is in Row [i]. The case grammar prescribes H1' in Row [ii]. The 'Depth' model prescribes H1'' in Row [iii]. As may be seen from the subscripts of the various means, the three different statistical alternative hypotheses are the implications of their respective experimental hypotheses at the statistical level.

33. A Triad of Hypotheses: the Substantive Hypothesis, H1 and Ho

Two things should be emphasized. First, multiple conceptual alternative hypotheses give rise to their respective statistical alternative hypotheses. Second, the multiple statistical hypotheses (e.g., H1, H1', and H1'' in Table 13) are not alternatives to one single Ho. They have their own null hypotheses (viz., Ho, Ho', and Ho'', respectively) even though these null hypotheses may be numerically equal to zero. However, they are zero under different conditions. This is indicated by the fact that three different conceptual hypotheses implicate different independent variables (viz., sentence-type, case, and negative-type, respectively, as may be seen from the subscripts in Table 13). In other words, each of these multiple null hypotheses describes what chance variations are like under its own unique set of conditions.

In short, the differences among the experimental expectations prescribed by diverse alternative substantive hypotheses at the conceptual level are not a matter of numerical differences such as u1 = 5, u2 = 10, u3 = 15, and the like. Consequently, the exclusion of unwarranted alternative hypotheses at the conceptual level is also not a matter of choosing among numerically different H1's (Chow, 1989). It involves testing different H1's defined by dissimilar data-collection conditions. Each of these H1's has its own H0.

34. Statistical Power and Sample Size

The best known utility of statistical power is that it can be used to disambiguate the difficulties brought about by the arbitrariness, ambiguity or anomaly attributed to NHSTP. It is argued that, if the test is of sufficient power, one can be sure that the statistically significant result is genuinely significant, and that a non-significant result is really non-significant. An important stipulation is that an appropriate sample size be determined with the help of the general purpose Sample Size Tables (Cohen, 1987). Researchers can determine the appropriate sample size with reference to (a) the desired power, (b) the desired effect size, and (c) the a level to be adopted. An alternative set of tables may be found in Kraemer and Thiemann (1987). This important function of statistical power is best summarized in [Quote 3].

... from a power analysis at, say, a = .05, with power set at, say, .95, so that fl = .05, also, the sample size necessary to detect this negligible effect with .95 probability can be determined. Now if the research is carried out using that sample size, and the result is not significant, as there had been a .95 chance of detecting this negligible effect, and the effect was not detected, the conclusion is justified that no nontrivial effect exists, at the fl = .05 level. (Cohen, 1990, p. 1309) [Quote 3]

This mechanical approach to sample-size determination is inappropriate for experimental studies. To begin with, no reference is made in power analysis to the experimental design used. In general, fewer subjects are required when the repeated-measures design is used than when using the completely randomized design. There are other considerations when the matched-pair (for the 1-factor, 2-level design) or matched-group (for the multi-factor, multi-level design) is used. The stability of the data may be influenced by the success of the matching procedure.

Another unsatisfactory feature of the power analytic way of determining the sample size is its disregard for how well trained the experimental subjects are. This, in turn, is dependent on the nature of the experimental task used. Given the same experimental task, data stability may be secured by using either a few well trained subjects when an unusual task like Sperling's (1960) partial-report task is used. Sometimes the nature of the investigation demands a large sample of naive subjects (e.g., Keppel & Underwood's, 1962, study of proactive interference). Furthermore, the number of subjects required may be influenced by the number of experimental sessions (as well as the number of trials within an experimental session). These important procedural considerations have not been taken into account in power analysis.

In sum, the concern with sample size has a lot to do with data stability. This issue cannot be settled with a mechanical procedure or a general purpose tool for the simple reason that data stability is not determined by the sample size alone. It is also affected by the nature of the experimental task, the amount of practice the subjects have before data collection, and the experimental design used. These considerations are some of the a priori considerations recommended by Neyman and Pearson (1928) in [Quote 1].

35. A Criticism of NHSTP with a Bayesian Overtone

As may be recalled from the 'Null-hypothesis Significance-test Procedure (NHSTP)' discussion in Section 4, important to NHSTP is the associated probability, p. It is the conditional probability, p(Data|H0). Some critics find the reliance on p unsatisfactory for various reason. For example, p is often misunderstood or knowing p(Data|H0) is not knowing p(H0|Data). At the same time, the assertion is made in power analysis that

Now, what really is at issue, what is always the real issue, is the probability that Ho is true, given the data, P(Ho|D), the inverse probability. (J. Cohen, 1994, p. 998) [Quote 4]

This concern with the inverse probability is like the Bayesian appeal to the posterior probability. However, to find NHSTP wanting for this reason goes beyond statistics. Instead, it raises questions about the nature of empirical research or the purpose of conducting empirical research. Of interest to the present discussion are methodological issues underlying the Bayesian, as well as the power analytic, theme that empirical data are collected to ascertain the posterior (or inverse) probability of the hypothesis of interest. The methodological issues are (a) the prototype of empirical research envisaged in the Bayesian approach, (b) the nature of the Bayesian hypothesis, and (c) the role of replication studies in empirical research.

36. The Bayesian Hypothesis and Sequential Sampling Procedure

It is assumed in the Bayesian approach that, before collecting data, the researcher attributes a prior probability (viz., the degree of belief) to the hypothesis. The research objective is not to ascertain the tenability of the hypothesis. Instead, data are collected to adjust the prior degree of belief with the Bayesian theorem. The new degree of belief in the hypothesis is the posterior probability. Given the Bayesian theorem, the evidential support for the hypothesis offered by the current data may not be sufficient to overcome the impact of the prior degree of belief. Hence, it is a Bayesian theme that researchers must take into account the prior probability of the hypothesis when they interpret the data.

That the Bayesian approach is of limited applicability to psychological research may be seen more readily after a discussion of the type of data-collection exercise suitable for Bayesian analysis. More important, it may be seen that the Bayesian approach cannot be used for theory-corroboration purposes because it cannot be used to test explanatory hypotheses. The following example is adapted from Phillips' (1973) illustration. The Bayesian emphasis on the prior probability is antithetical to objectivity.

Suppose that a newspaper editor, E, would endorse the centre party in the coming election when it is preferred by 75% of the prospective voters polled. Editor E commissions polls about the impending election in order to decide whether or not the criterion for endorsing the centre party is met. Columns 1 through 5 in Table 14 represent 5 successive polls conducted. Entries in the 'Prior Probability' rows represent Editor E's prior degrees of belief in the three parties winning the election before the polling period specified by the column number. Specifically, Editor E assigns prior probabilities of .50, .60 and .40 to the left-wing, centre, and right-wing parties, respectively, before the first poll (see the 'Prior Probability' entries in Column 1 in Table 14). The entry in the cell intersecting a column and the 'Evidence' row represents the poll result about the centre party in the poll in question. In other words, the percentages of people polled who choose the centre party are 30, 38, 45, 55 and 50 in Periods 1 through 5, respectively.

Table 14. The accumulation of data and the conversion of the prior probability (Prior DOB) into its corresponding posterior probability at successive research stages



Poll-data Inspection Period







1
2
3
4
5

Prior Probability

HL
.50
.36
.31
.27
[.27]

HC
.60
.40
.44
.59
[.59]

HR

.40

.32
.24
.14
[.14]
Evidence

30

38
45
55
[50]
Likelihood of Evidence
HL
.38
.38
.40
.30
[.40]

HC
.35
.50
.60
.75
[.60]

HR

.32
.35
.25
.35
[.25]
Prior Probability * Likelihood
HL
.19
.14
.12
.08
[.11]

HC
.21
.20
.26
.44
[.35]

HR

.13

.11
.06
.05
[.04]
Posterior Probability
HL
.36
.31
.27
.18
.22

HC
.40
.44
.59
.77
.70

HR
.32
.24
.14
.09
.08

HL = The left-wing party will form the next government. [H7-1]

HC = The centre party will form the next government. [H7-2]

HR = The right-wing party will form the next government. [H7-3]

Evidence = The percentage of voters polled indicate a preference for the centre party in the present example.

Likelihood of Evidence = The probability of the evidence, given that HC (HL or HR) is true.

The 'Likelihood of Evidence' is the probability of the evidence (e.g., 30% of those polled favor the centre party), if the centre party actually wins the election (viz., .35). As may be seen from the 'Prior Probability x Likelihood' and 'Posterior Probability' rows, the posterior probability is given by the Bayesian theorem represented by Equation [Eq. 5]:

Posterior Probability = (Prior Probability x Likelihood of Evidence)/ Sum of All (Prior Probability x Likelihood of Evidence) [Eq. 5]

Editor E's mode of decision-making is called the 'sequential sampling procedure' (Phillips, 1973, p. 66) because of the following characteristics of the data-collection procedure. First, evidential information is gathered in stages. Second, the status of the evidence is examined at the end of every stage (e.g., a percentage in Table 7.1). Third, the evidence collected in successive stages is accumulated in the following way. The posterior probability of any stage serves as the prior probability of its immediately succeeding stage. Fourth, the data-collection procedure is self-terminating in the sense that it stops when the posterior degree of belief assumes a certain value.

It is important to emphasize that these four sequential-sampling features are not found in a typical experiment (e.g., the kernel-negative experiment described above). Moreover, Phillips's (1973) 'sequential sampling procedure' characterization does not reflect four other important features of the Bayesian approach. For ease of exposition, these four additional features will be called the 'reflexive' features of the Bayesian data-collection procedure.

First, none of the hypotheses is proposed to explain a phenomenon which invites the investigation. Instead, they are hypotheses about an uncertain events in the future. This is unlike the explanatory substantive hypothesis depicted in Table 1. Second, the Bayesian analysis is not about the truth of hypotheses at all. Editor E does not collect data to accept or reject any of the hypotheses (e.g., HC). Rather, Editor E is interested in the 'probabilification' of the hypothesis (Earman, 1992, p. 79). Hence, the reasoning shown in Table 3 or 4 cannot be carried out in the Bayesian approach.

Third, the procedure is reflexive in the sense that the termination of the data-collection procedure depends fortuitously on when the periodic data-inspection is carried out. Had the fourth inspection been delayed till Inspection 5, the evidence would be 50% instead of 55%. (Hence, the 'Prior Probability' entries in Column 4 and the 'Likelihood of Evidence' entries in Column 3 are duplicated in Column 5.) In such an event, the posterior probability for HC is only .70, which is not sufficient for Editor E to stop the polling. On the other hand, the Bayesian sequential sampling is also open-ended. Specifically, the poll does not stop after the third inspection because the posterior probability (viz., .59 in Table 14) is smaller than the one desired by Editor E (viz., .75). the Bayesian modus operandi is best summarized in [Quote 4]:

... the scientist can design an experiment to enable him to collect data bearing on certain hypotheses which are in question, and as he gathers evidence he can stop from time to time to see if his current posterior opinions, determined by applying Bayes' theorem, are sufficiently extreme to justify stopping the experiment (Phillips, 1973, p. 66) [Quote 4]

A concomitant feature of the reflexivity and open-endedness found in the Bayesian methodology is that the size of the data set is ill-defined. It is determined fortuitously by the data-collection procedure.

Fourth, the decision to stop data collection is made on the basis of a criterion not related to what is said in the hypothesis. Note that Hc is the hypothesis that the centre party will form the next government. At the same time, Editor E's decision criterion is not the truth of HC, but how certain Editor E is of HC. That is, whether or not the centre party actually forms the next government has no bearing on the reason why the poll is conducted. It is also for this reason that Bayesians do not (and cannot) assess their data with reference to a well-defined criterion in a way independent of the prior probability. Consequently, Bayesians do not talk about objectivity because there is no objective entity or event against which Editor E's decision may be assessed.

In contrast to the third and fourth reflexive features of the Bayesian sequential sampling procedure, experimental psychologists do not treat their data in such a fortuitous way. Instead, the size of the data set in experimental psychology is determined before data are collected. That is, what is said in [Quote 4] is the exact opposite of what experimental psychologists would (or should) do. Specifically, experimenters adhere to their experimental plan, in which are stated, among other things, (a) the number of subjects, (b) the number of sessions a subject has to undergo, and (c) the number of trials per session. There is nothing fortuitous about the size of the data set. Moreover, the experimenter has to assess the consistency between the phenomenon and the substantive hypothesis, as well as that between the experimental prescription and the data. In other words, objectivity is important as well as possible in experimental psychology.

In short, the data-collection procedure congenial to the Bayesian analysis is not appropriate for most psychological research. At the same time, the Bayesian hypothesis is not a prospective explanation of a phenomenon that exists before, as well as independently of, the data collection procedure. The reflexive dependence of the Bayesian hypothesis on the data-collection procedure is responsible for the Bayesian disregard of objectivity. Consequently, the applicability of the Bayesian method to psychological research in general, theory-corroboration experiments in particular, is questionable. The tenability of explanatory hypotheses cannot be ascertained by appealing to the researchers' subjective degrees of belief if the to-be-explained phenomenon exists prior to, as well as independent of, the data-collection exercise.

37. Methodological Criticisms in Disguise

Mention has been made that the role of the associated probability, p, in NHSTP is questioned by critics. Specifically, although p is not an index of the replicability of the result, the researcher tends to stop further investigation after a significant result (Bakan, 1966). Critics find this wanting. They argue that researchers should conduct replication studies. This is important to Bayesians because their objective of empirical research is to revise the prior probability in light of new data. It is also necessary to conduct replication studies if one subscribe to the meta-analytic approach (Glass et al., 1981; Hunter & Schmidt, 1990; Rosenthal, 1984). Moreover, it has also been said that NHSTP results may be disambiguated with replication studies (Thompson, 1996).

These arguments for conducting replication studies is an indication that the real concern implicit in some criticisms of NHSTP has nothing to do with it as a statistical procedure. The criticisms are methodological critiques in disguise (inadvertently, perhaps). For this reason, another argument in support of NHSTP may be offered by showing that replication is not sufficient for theory corroboration. Worse still, successful replications may actually be misleading. Instead, the tenability of an explanatory substantive hypothesis is ascertained with a series of converging operations (Garner, Hake, & Eriksen, 1956). NHSTP is used in every study in the series.

38. How Important is Replicability?

Several reasons may explain why replicability has captured critics' favorable attention. First, there are the ambiguity-anomaly criticisms of NHSTP, the collective point of which is that statistical significance may be reached fortuitously. There is the additional assumption that a fortuitous result is unlikely to be replicated. However, these criticisms can be answered, as may be recalled from 'The Sample Size-Significance Dependence Problem Revisited' discussion in Section 15. Be that as it may, some critics see another source of ambiguity.

They emphasize the arbitrariness of the choice of the a level. Specifically, setting a = .05 is a convention. The question is raised as to why a is not set at the .10, .07, .01 or .005 level. More important, a result significant at the .05 level may not be significant at the .01 level (e.g., when the calculated t is 2.4 for a 1-tailed test with df = 18). By the same token, a result not significant at the .05 level may be significant at the .1 level (e.g., when the calculated t is 1.5 for a 1-tailed test with df = 18). In short, statistical significance may simply be the fortuitous choice of the a level.

This 'fortuitous choice of a' criticism of NHSTP seems like a demand for an absolute proof for the substantive hypothesis. This demand cannot be met on logical grounds because it is impossible to prove any theory (i.e., substantiate any theoretical claim) with absolute certainty. It is not a limitation brought about by using NHSTP. It is the result of having to affirm the consequent of the major premises of the conditional syllogistic arguments implicated (see Table 3). It so happens that deductive logic does not allow drawing a definite conclusion about the antecedent of the major premise by affirming its consequence.

There is a further reason why the "fortuitous choice of " criticism can logically not be answered. The reason is the reality of multiple explanations for any phenomenon. To ask for absolute certainty under such circumstances is to require the elimination of all possible alternative explanations. As this is impossible, the best one can do is to draw a tentative conclusion with the help of the inductive principle underlying the experimental design (see Table 5). The important point is that the criticism that the experimental result does not provide the conclusive evidential support should not be directed to NHSTP at all because it is not a difficulty brought into the research procedure because NHSTP is used.

At the same time, it is possible to show that the 'fortuitous choice of a' criticism itself is misguided. The criticism should be considered with reference to the fact that the a level is determined before data collection. That is, the decision is made before data collection that the level of strictness of rejecting H0 stipulated by a is sufficient for the research in question. Consequently, subsequent decisions about the experimental task, experimental design, number of subjects, amount of training given to the subjects, number of test sessions, number of trials per session, and the like are all made with the understanding that the chosen a level is of sufficient strictness. Had a more stringent a level been deemed necessary, the other features of the experiment would have been different. Moreover, the 'fortuitous choice of a' criticism is a vacuous criticism because critics can always stipulate a stricter criterion (viz., .001, .0005, etc.) after the completion of data collection.

Suppose that the statistical significance of the result is deemed not fortuitous and that the a level is accepted as adequate. Critics may still point out that replication studies are necessary because there is still the .05 probability of committing the Type I error. What does it mean to have rejected H0 incorrectly? It means attributing the observed effect (in the technical sense of the word) to the experimental manipulation when, in fact, another variable may be used to explain the result. This is another way of saying that the experimental manipulation may have been confounded with an unknown variable. Seen in this light, the 'fortuitous choice of a' critique is a design issue, not a difficulty inherit in NHSTP. Important for the present discussion is that this difficulty cannot be eliminated by conducting replication studies because a pre-requisite of a replication study is to reinstate the data collection conditions of the original study. To the extent that this reinstatement is successful, the original confounding may still occur. In other words, absolute certainty about the substantive hypothesis cannot be established by successful replications. This state of affairs calls into question the necessity of conducting replication studies.

To recapitulate, critics' insistence on conducting replication studies seems to be motivated by (a) the wish to establish the tenability of the substantive hypothesis with absolute certainty and (b) the desire to disambiguate the NHSTP outcome. In defence of NHSTP, it is suggested that these concerns are methodological, not statistical. For example, some critics suggest that researcher should report the associated probabilities, p, of individual studies so that meta-analysis can be carried out.

It can be concluded that, while replicability may be the necessary condition, it is not the sufficient condition. A more positive defence of NHSTP consists of showing (a) the 'fortuitous choice of a' criticism cannot be used to justify the meta-analytic approach, (b) how cognitive psychologists ascertain a substantive hypothesis by eliminating alternative substantive hypotheses and unknown confounding variables with converging operations, and (c) NHSTP is used in all such attempts. This positive defence may be presented in the context of studying the iconic store experimentally.

39. The 'Perceive More Than Can Be Recalled' Phenomenon

Suppose that you take a quick glance at the rear mirror while driving. You can report only a few things despite the feeling that you have seen more. This not uncommon experience is the phenomenon to be explained by the iconic store (Neisser, 1967; Sperling, 1960). The iconic store is said to have a relatively large storage capacity and a very short retention interval. Forgetting from the iconic store occurs because of information decay. Lastly, only sensory unprocessed information is available in the iconic store.

Suppose that there are 12 studies of the iconic store (see the 'Study' column in Table 15). While the result is significant in eight studies, it is non-significant in Studies 4, 5, 9, and 12 (see the 'p of Test Statistic' column). Among the eight studies with statistical significant results, the p values of Studies 6 and 8 are very close to .05 while that of Study 3 is exactly .05. Although the result is non-significant in Study 5, the p value does not differ from .05 by much. This state of affairs may reinforce the 'fortuitous choice of a' criticism.

Table 15. The incommenurability difficulty of meta-analysis illustrated with fictitious 'raw data'

Study
p of Test Statistic
Effect Size1
Independent Variable
Property or Function of the Iconic Store Studied
1
.021*
0.7
ISI2
Rate of decay
2
.001*
0.3
Type of task
Relatively large storage capacity
3
.050*
0.5
Number of concurrent tasks
Independence from the short-term store
4
.110
0.11
What to recall
Independence of location and identity information
5
.068
0.17
Stimulus material
Non-associative information
6
.049*
0.4
Type of task
Visible persistence
7
.02*
1.5
Type of material
Unprocessed information
8
.046*
1
Time of probe presentation
Select before processing
9
.070
0.06
ISI
Visible persistence
10
.04*
0.18
Stimulus duration
Information registration rate
11
.038*
0.2
Stimulus duration within a fixed SOA3
Information registration rate
12
.066
0.29
Type of material
No identity information

Combined Z

Combined Effect Size



* denoted significance at the 0.05 level



1J. Cohen's (1987) d



2 'ISI' refers to the inter-stimulus interval, the interval between the offset of the stimulus and the onset of the partial-report tone (see Footnote 2 in Chapter 5).


3 'SOA' refers to stimulus-onset asynchrony, the interval between the onset of the stimulus and the onset of the mask

40. Meta-analysis and Its Difficulties

Some critics suggest that the p values of the test statistic may be used to obtain a combined Z or the effect-size estimates may be used to obtain the combined effect size. A statement about the overall statistical significance is then made on the basis of the combined Z, called 'combined significance level' (Harris & Rosenthal, 1985). That is, the p, or effect-size, values from individual studies are treated as raw data (hence 'raw data' in the table's title) and subjected to statistical analysis at a higher level of abstraction. This more abstract analysis is called meta-analysis or the 'analysis of analysis' (Glass, 1976, 1978; Glass & Kliegl, 1983; Glass et al., 1981; Harris & Rosenthal, 1985; Schmidt, 1992).

Meta-analysis is recently promoted as a theory-corroboration tool, in addition to being an antidote for rectifying the harm done by using NHSTP (Cooper, 1979; Cooper & Rosenthal, 1990; Schmidt, 1996). Specifically, it is an important meta-analytic assumption that knowledge grows with the accumulation of research results. The binary nature of the NHSTP outcome is deemed incompatible with the incremental growth of knowledge. This difficulty is amplified by the 'fortuitous choice of a' criticism.

Some meta-theoretical issues have to be settled first before meta-analysis can be accepted as a valid theory-corroboration tool. They are (a) the selection, (b) a lack of independence, (c) the unjustifiable disregard for research quality, and (d) a lack of commensurability among the to-be-aggregated studies problems (see Chow, 1987b, 1987c; Cook & Leviton, 1980; Eysenck, 1978; Gallo, 1978, Leviton & Cook, 1981; Mintz, 1983; Rachman & Wilson, 1980, Sohn, 1980; and Wilson & Rachman, 1983). Of interest here is the problem of a lack of commensurability among the studies included in the meta-analysis.

To use the combined Z or effect size of the 12 studies in Table 15 to ascertain the tenability of the iconic store is to assume that it is legitimate to combine the data from the 12 experiments and to use them in toto. This assumption must be questioned in view of the fact that different independent variables are used (see the 'Independent Variable' column in Table 15). At the same time, different dependent variables are implicated. For example, the dependent variables may be the number of items available in one study, but the correct reaction times, number of errors made or the types of error made in another study. How is it meaningful to take the average of the effect measured in terms of correct reaction times and that measured in terms of the frequencies of different kinds of errors? What would the average mean at the conceptual level? In other words, to combine the information from qualitatively different experiments is like the illegal practice of mixing apples and oranges (Cook & Leviton, 1980; Mintz, 1983; Presby, 1978).

Glass et al. (1981) answer the 'apples and oranges' reservation by saying that apples and oranges are fruits. They argue that things that are incommensurable at one level of discourse become commensurable if they are subsumed in a higher-order category. However, this is not a good answer. Specifically, the outcome of Study A may be due to acidity of oranges, which is not a property common to all fruits. At the same time, the texture of apples may be the reason for the outcome of Study B; and the texture is also not a property common to all fruits. That is to say, meta-analysts have not provided a good theoretical justification for ignoring the qualitative differences among the diverse sets of research data.

41. Converging Operations

It is an meta-analytic assumption that our understanding of an issue improves when more data are accumulated and used in toto. Moreover, merely knowing the significance or non-significance of individual studies is not suitable for such a data-accumulation exercise. However, the 'apples and oranges' difficulty shows that the meta-analytic argument is debatable because of the fact that 'accumulate' is used in a quantitative and mechanical sense. The disregard for the qualitative differences among various studies prevents meta-analysts from seeing that knowledge evolves in a far from straightforward fashion. There are a lot of trials and errors at the conceptual level. NHSTP plays an important role in every one of these steps. This alternative view (to the meta-analytic one) may also be used to illustrate (a) the rationale of conducting converging operations, (b) how the difficulty of perpetuating confounding variables in replication studies may be minimized, and (c) the difficulty with the Bayesian insistence on interpreting current data with reference to prior probability.

42. The Rationale of Converging Operations

The implicative relations among the quartet of hypotheses implicated in the theory-corroboration experiment depicted Table 1 becomes more complicated as the investigation of the substantive hypothesis progresses. Successive rows of Table 16 represent successive stages of the theory-corroboration endeavor. The series of studies need not (and is often not) undertaken by the same experimenter.

The only requirement for the tenability of the iconic store before any data collection is that what is attributed to the iconic store (i.e., what is said in H) be consistent with P, the 'perceive more than can be recalled' phenomenon (see the 'Before Experimentation' row in Table 16). Phenomenon P is the prior data in the sense that it exists before the substantive hypothesis. There is no evidential data for the iconic before the first experiment because Phenomenon P itself cannot be used as the evidence (hence, there is no entry in the cell defined by the intersection of the 'Data' column and the 'Before Experimentation' row). To use the original phenomenon for such a purpose is to commit the circularity error.

Table 16. The phenomenon-hypothesis-implication-data (P-H-I-D) consistency at different research stages of the substantive hypothesis, H

Experiment

Phenomenon/Prior Data
Hypothesis

Implication

Data

P-H-I-D Consistency
Before Experimentation
P
H


Yes
1
P
H
I1
D1
Yes
2
P+D1
H
I2
D2
Yes
3
P+D1+D2
H
I3
D3
Yes
...

H
...
...
Yes
...

H
...
...
Yes
n-1
P+D1 ...+Dn-2
H
In-1
Dn-1
Yes
n
P+D1+...Dn-1
H
In
Dn
Yes
...

H
...
...
Yes
t
P+D1+ ...Dt-1
H
It
Dt
Yes

In the absence of any evidential data, the implication of the first study (viz., I1) has to be consistent with P, in addition to being a theoretical derivation from H. The data from this study (i.e., D1) is compared to what is said in I1. Necessary for the tenability of H is that D1 be consistent with I1. Hence, the emphasis is on the phenomenon-hypothesis-implication-data (P-H-I-D) consistency in Table 16. This is the basis of objectivity, something ignored in the Bayesians and meta-analytic approaches.

The data-implication consistency (i.e., the consistency between Di and Ii; where i represents the row number in Table 16) is ascertained with NHSTP, as described in Tables 2 and 3 (or 4, as the case may be). Only a binary decision is required to initiate the chain of deductive reasoning depicted in Table 3 or 4. Hence, the binary nature of NHSTP decision is adequate for this purpose. Why do critics consider the binary decision incompatible with the incremental growth of knowledge? One reason is that, as has been suggested earlier on, critics identify NHSTP with the theory-corroboration process itself. The other reason is that there are two ways to look at the information accumulated, as well as how the accumulated information is utilized.

Meta-analysts accumulate data via the test statistics of individual studies (see the 'p of Test Statistic' or 'Effect Size' column in Table 15). Bayesians accumulate raw data via the role played by the prior probability in the Bayesian theorem (see Table 14). In both cases, all data accumulated to-date are used as the evidence. In contrast, only data collected in the current study are used to ascertain the tenability of the substantive hypothesis (e.g., as may be seen from Table 16 that only Dn is used to assess the tenability of In in Study n; see also Tables 2, 3 and 4). More important, research data obtained in previous studies have no evidential role in the current study. Instead, conclusions drawn from earlier data are used in cognitive psychology as theoretical constraints on the 'If H, then Implication' derivation. For example, Implication In has to be consistent with P+D1+...Dn-1 (see Row [n - 1] of Table 16).

Each of the implications in Table 16 is a criterion of rejection for the substantive hypothesis, H in the following sense. H has to be rejected if the data (e.g., Dn) are inconsistent with what is stipulated in any study (viz., In in the case of Study n). In more concrete terms, studies of the iconic store are attempts to substantiate the theoretical properties that have been attributed to the iconic store (viz., those tabulated in the 'Property or Function of the Iconic Store Studied' column of Table 15). For example, a fast rate of decay has been attributed to the information residing in the iconic store. An implication of this theoretical property is that performance on Sperling's (1960) partial-report task should decline rapidly within 250 to 500 msec when the inter-stimulus interval (ISI) is manipulated. The postulation of the iconic store would be untenable if this theoretical prescription were not demonstrated.

It follows that the 'Independent Variable' and 'Property or Function of the Iconic Store Studied' columns in Table 15 jointly represent attempts to falsify the theory of the iconic store from various angles (viz., by testing the various theoretical properties of the iconic store with different experimental tasks in diverse settings by different experimenters). Suppose all these falsification attempts fail. That means that, as the research progresses, more and more of its theoretical properties are substantiated in qualitatively different situations. Hence, these various studies may be said to converge on the tenability of the iconic store. Researchers' confidence in the iconic store as a theoretical mechanism increases as more and more of these falsification attempts fail. In short, the series of experiments collectively form the converging operations used in validating the iconic store.

The studies depicted in Table 15 are not replication studies because they differ greatly among themselves. Suppose that critics question the statistically significant result of Study 4. As has been suggested earlier, this reservation is mostly a question about data interpretation. One source of ambiguity is the presence of an unknown variable which has varied systematically with the research manipulation. That is, the data may say something else other than ascertaining the property of the iconic store. This ambiguity cannot be eliminated by replicating the experiment because every time the original study is repeated, the confounding variable would also occur.

The situation is very different in the case of converging operations. The theoretical property investigated in Study 4 has a different implication in another setting (hence, Study 8). Although Studies 4 and 8 are about the same theoretical property of the iconic store, radically different tasks and experimental manipulations are involved. Hence, in general terms, it is less likely that the same confounding variable is found in all of the studies when radically different data-collection conditions are used in the series of converging operations. This is the reason why conducting converging operations is more satisfactory than conducting replication studies.

43. Objectivity and The Bayesian Prior Probability

Bayesians bemoan the fact that non-Bayesian researchers do not take into account the prior probability of the hypothesis when research conclusions are drawn. This Bayesian point may be illustrated with Column 1 of Table 14. The evidence is that 30% of the voters polled indicate a preference for the centre party (i.e., HC). The likelihoods are .38, .35 and .32 for HL, HC, and HR, respectively. That is, one may argue that the evidence is more favorable to HL than to HC if one ignores the impact of the prior probabilities of the three hypotheses. Bayesians would point to the fact that the posterior probabilities are .36, .40, and .32, respectively for HL, HC and HR. In Phillips' (1973) words,

Whether or not the [poll result] tells the [editor] something about the [election], but the information conveyed by the [poll] is far less than that shown in the prior probabilities. In this case the prior probabilities swamp out the information in the [poll], so that the posterior probabilities are determined more by the priors than by the likelihoods. The extra information given by the [poll] does not change the prior probabilities enough to warrant [changing the editor's editorial decision]. (Phillips, 1973, p.74, explications in parentheses added) [Quote 5]

The Bayesian treatment of the prior probability of the hypothesis is the opposite of what experimental psychologists should accept. First, what does a higher posterior probability mean? It means simply that Editor E's has the highest confidence in HC. However, this is not the same as saying that HC is necessarily or inevitably true. Second, Bayesians find 'probabilification' satisfactory because they treat their reflexive sequential sampling task as the prototype of all empirical research. Consequently, they do not find it necessary to assess the data with reference to the consistency between (a) the to-be-explained phenomenon and the substantive hypothesis and (b) the experimental prescription and the data.

There is a third reason why what is said in [Quote 5] is debatable. Do Bayesians mean to suggest that how researchers should be designed and data be interpreted in ways determined by how the researcher feels about the hypothesis? Is this not an invitation to inject biases into the research processes? This is the opposite of what should be the case in view of the rationale described in Tables 1, 2, 3 and 5.

Experimenters following the research rationale represented in Tables 1 and 16 always give the to-be-corroborated hypothesis the benefit of the doubt. That is, the derivation of the experimental hypothesis (and hence, the design and execution of the experiment) is based on the assumption that the substantive hypothesis is true. This assumption is made even when the experimenter does not like the hypothesis. Moreover, the statistical, experimental and theoretical conclusions are drawn with reference to the rules described in Tables 2 and 3 (or 4). How the researcher feels about the hypothesis or data has no role in the statistical, inductive and deductive reasoning. It is necessary to consider the Bayesian methodology more judiciously.

44. Summary and Conclusions

Not much is said in this defence about the criticism that various aspects of NHSTP are often misunderstood (e.g. the meaning of p, identifying a as an index of replicability, etc.) for the simple reason that they are not difficulties inherent in NHSTP. Instead, the emphases in this defence are on some meta-theoretical assertions about NHSTP. The most notable one is the commonly accepted view that the null hypothesis is never true. This view is problematic because the null hypothesis is not used in NHSTP as a categorical proposition descriptive of the world.

The null hypothesis appears in two different conditional propositions. First, it is the implication of the hypothesis that chance influences are responsible for the experimental result. What is said in the null hypothesis is (and should be) true if the control and experimental conditions are set up properly. Second, the null hypothesis is used to stipulate the lone sampling distribution to be used in making the statistical decision about chance influences. This utility of the null hypothesis shows that NHSTP is often misrepresented at the graphical level. More important, it shows that statistical significance and effect size (or statistical power) belong to different levels of abstraction (viz., the level of sampling distribution versus the level of raw scores).

The present defence of NHSTP is conducted in the context of the theory-corroboration. It is argued that NHSTP is not theory corroboration. However, NHSTP provides the objective means to exclude chance influences as an explanation of research data. This statistical decision provides the minor premise for the first of three embedding syllogisms. The asymmetry between modus tollens and affirming the consequent arguments can be tentatively resolved by appealing to an inductive rule underlying the experimental design. These considerations lead to the conclusion that many criticisms of NHSTP are actually questions about the inductive conclusion validity of the research.

The putative importance of the effect size is called into question by showing that the size of the effect is not an index of the evidential support for the substantive hypothesis offered by the data. Nor can the effect size, by itself, be the index of the practical importance of the research result in the case of the utilitarian experiment. It is made clear that statistics and practical validity belong to different domains.

Some difficulties with power analysis are illustrated with the affinity between NHSTP and TSD envisaged in power analysis. A notable example is that, being a conditional probability, statistical power cannot be the probability of obtaining statistical significance. As there is a Bayesian overtone in power analysis, power analysis can be questioned to the extent that the Bayesian assumptions about research methodology are debatable. At best, the Bayesian approach has a very limited applicability in psychological research because it is applicable only to the sequential sampling procedure. It cannot be used to investigate explanatory hypotheses.

In short, what motivates some criticisms of NHSTP may be understood by the fact that although statistical significance provides a rational basis for rejecting chance influences as an explanation of data, it is not informative as to what the nonchance factor is. Moreover, statistical significance says nothing about the real-life importance of the data. It is argued in this defence that the real issue concerns why NHSTP is expected to furnish such information, given that NHSTP is a statistical procedure. What also has to be said is that the alternative numerical indices suggested by critics of NHSTP (viz., effect size and confidence interval) are also incapable of pinpointing the nonchance factor responsible for the research result. The point is simply that statistics and practical importance belong to two different domains. Why should the tools in one domain be applicable to settle questions belonging to another domain?

Be that as it may, there is a positive way to look at the criticisms of NHSTP. The critiques are attempts to challenge NHSTP users to rationalize the research procedure in general, and the role of NHSTP in such a procedure in particular. The present defence of NHSTP is one such attempt. This account of NHSTP is not expected to be the final one. Nonetheless, it will fulfill its purpose if it serves as the basis for further exploration of the issues raised in the course of the present argument. It is hoped that a coherent view of NHSTP pertinent to empirical research will emerge from the ensuing discussion.

References

Bakan, D. (1966). The test of significance in psychological research. Psychological Bulletin, 66, 423-37.

Boring, E. G. (1954). The nature and history of experimental control. American Journal of Psychology, 67, 573-89.

Boring, E. G. (1969). Perspective: Artifact and control. In R. Rosenthal, & R. L. Rosnow (Eds.), Artifacts in behavioral research (pp. 1-11). New York: Academic Press.

Campbell, D. T. (1969). Prospective: Artifact and control. In R. Rosenthal & R. L. Rosnow (eds.), Artifact in behavioral research (pp. 351-382). New York: Academic Press.

Campbell, D. T., & Stanley, J. C. (1963). Experimental and quasi-experimental designs for research. Chicago: Rand McNally.

Chomsky, N. (1957). Syntactic structures. The Hague: Mouton.

Chow, S. L. (1987a). Experimental Psychology: Rationale, procedures and issues. Calgary: Detselig.

Chow, S. L. (1987b). Some reflections on Harris and Rosenthal's thirty-one meta-analyses. Journal of Psychology, 121, 95-100.

Chow, S. L. (1987c). Meta-analysis of pragmatic and theoretical research: A critique. Journal of Psychology, 121, 259-71.

Chow, S. L. (1988). Significance test or effect size? Psychological Bulletin, 103, 105-10.

Chow, S. L. (1989). Significance tests and deduction: Reply to Folger (1989). Psychological Bulletin, 106, 161-5.

Chow, S. L. (1991a). Conceptual rigor versus practical impact. Theory & Psychology, 1, 337-60.

Chow, S. L. (1991b). Rigor and logic: A response to Comments on "Conceptual Rigor." Theory & Psychology, 1, 389-400.

Chow, S. L. (1991c). Some reservations about statistical power, American Psychologist, 46, 1088-9.

Chow, S. L. (1992). Research methods in psychology: A primer. Calgary: Detselig.

Cohen, J. (1965). Some statistical issues in psychological research. In B. B. Wolman (Ed.), Handbook of clinical psychology (pp. 95-121). New York: McGraw-Hill.

Cohen, J. (1987). Statistical power analysis for the behavioral sciences (Revised edition). New York: Academic Press.

Cohen, J. (1990). Things I have learned (so far). American Psychologist, 45, 1304-12.

Cohen, J. (1992a). Statistical power analysis. Current Directions in Psychological Science, 1, 98-105.

Cohen, J. (1992b). A power primer. Psychological Bulletin, 112, 155-9.

Cohen, J. (1994). The earth is round (p < .05). American Psychologist, 49, 997-1003.

Cohen, M. R., & Nagel, E. (1934). An introduction to logic and scientific method. London: Routledge & Kegan Paul.

Coltheart, M. (1980). Iconic memory and visible persistence. Perception & Psychophysics, 27, 183-228.

Cook, T. D., & Campbell, D. T. (1979). Quasi-experimentation: Design and analysis issues for field settings. Chicago: Rand McNally.

Cook, T. D., & Leviton, L. C. (1980). Reviewing the literature: A comparison of traditional methods with meta-analysis. Journal of Personality, 48, 449-72.

Cooper, H. M. (1979). Statistically combining independent studies: A meta-analysis of sex differences in conformity research. Journal of Personality and Social Psychology, 37, 131-46.

Cooper, H. M., & Rosenthal, R. (1980). Statistical versus traditional procedures for summarizing research findings. Psychological Bulletin, 87, 442-9.

Copi, I. (1982). Symbolic logic (6th edition). New York: MacMillan.

Danziger, K. (1990). Constructing the subject: Historical origins of psychological research. Cambridge: Cambridge University Press.

Darlington, R. B., & Carlson, P. M. (1987). Behavioral statistics: Logic and methods. New York: Collier Macmillan Publishers.

Earman, J. (1992). Bayes or bust? A critical examination of Bayesian confirmation theory. Cambridge, Mass.: The MIT Press.

Eysenck, H. J. (1978). An exercise in mega-silliness. American Psychologist, 33, 517.

Falk, R., & Greenbaum, C. W. (1995). Significance tests die hard: The amazing persistence of a probabilistic misconception. Theory & Psychology, 5, 75-98.

Fillmore, C. J. (1968). The case for case. In E. Bach, & R. T. Harms (Eds.), Universals in linguistic theory (pp. 1-90). New York: Holt, Rinehart, and Winston.

Fisher, R. A. (1959). Statistical methods and scientific inference (2nd edition). New York: Hafner Publishing Co.

Gallo, P. S., Jr. (1978). Meta-analysis--A mixed meta-phor? American Psychologist, 33, 515-7.

Garner, W. R., Hake, H. W., & Eriksen, C. W. (1956). Operationalism and the concept of perception. Psychological Review, 63, 149-59.

Gergen, K. J. (1991). Emerging challenges for theory and psychology. Theory & Psychology, 1, 13-35.

Gigerenzer, G. (1993). The superego, the ego, and the id in statistical reasoning. In G. Keren, & C. Lewis (Eds.), A handbook for data analysis in the behavioral sciences: Methodological issues (pp. 311-39). Hillsdale, New Jersey: Lawrence Erlbaum Associates.

Glass, G. V. (1976). Primary, secondary and meta-analysis of research. Educational Researcher, 5, 3-8.

Glass, G. V. (1978). Integrating findings: The meta-analysis of research. Review of Research in Education, 5, 351-79.

Glass, G. V., & Kliegl, R. M. (1983). An apology for research integration in the study of psychotherapy. Journal of Consulting and Clinical Psychology, 51, 28-41.

Glass, G. V., McGaw, B., & Smith, M. L. (1981). Meta-analysis in social research. Beverly Hills, CA: Sage.

Haber, R. N. (1983). The impending demise of the icon: A critique of the concept of iconic storage in visual information processing. Behavioral and Brain Sciences, 6, 1-11.

Hagen, R. L. (1997). In praise of the null hypothesis statistical test. American Psychologist, 52, 15-24.

Harris, M. J., & Rosenthal, R. (1985). Mediation of interpersonal expectancy effects: 31 meta-analyses. Psychological Bulletin, 97, 363-86.

Hogben, L. (1957). Statistical theory: The relationship of probability, credibility ad error. New York: W. W. Norton.

Hunter, J. E., & Schmidt, F. L. (1990). Methods of meta-analysis: Correcting error and bias in research findings. Newburry Park, California: SAGE Publications.

Keppel, G., Underwood, B. J. (1962). Proactive inhibition in short-term retention of single items. Journal of Verbal Learning and Verbal Behavior, 1, 153-161.

Kirk, R. E. (1984). Basic statistics (2nd edition). Pacific Grove, CA: Brooks/Cole.

Kraemer, H. C., & Thiemann, S. (1987). How many subjects? Statistical power analysis in research. Newbury Park, CA: Sage.

Leviton, L. C., & Cook, T. D. (1981). What differentiates meta-analysis from other forms of review. Journal of Personality, 49, 231-6.

Manicas, P. T., & Secord, P. F. (1983). Implications for psychology of the new philosophy of science. American Psychologist, 38, 399-413.

Meehl, P. E. (1967). Theory-testing in psychology and physics: A methodological paradox. Philosophy of Science, 34, 103-15.

Meehl, P. E. (1978). Theoretical risks and tabular asterisks: Sir Karl, Sir Ronald, and the slow progress of soft psychology. Journal of Consulting and Clinical Psychology, 46, 429-71.

Meehl, P. E. (1990). Appraising and amending theories: The strategy of Lakatosian defense and two principles that warrant it. Psychological Inquiry, 1, 108-41.

Mill, J. S. (1973). A system of logic: Ratiocinative and inductive. Toronto: University of Toronto Press.

Miller, G. A. (1956). The magical number seven, plus or minus two: Some limits on our capacity for processing information. Psychological Review, 63, 81-97.

Miller, G. A. (1962). Some psychological studies of grammar. American Psychologist, 17, 748-62.

Mintz, J. (1983). Integrating research evidence: A commentary on meta-analysis. Journal of Consulting and Clinical Psychology, 51, 71-5.

Mook, D. G. (1983). In defense of external invalidity. American Psychologist, 38, 379-87.

Morrison, D. E, & Henkel, R. E. (Eds.). (1970). The significant test controversy: A reader. Chicago: Aldine.

Mosteller, F., & Bush, R. R. (1954). Selected quantitative techniques. In G. Lindzey (Ed.), Handbook of social psychology: Volume 1 - Theory and method (pp. 289-334). Reading, Mass.: Addison-Wesley.

Neisser, U. (1967). Cognitive psychology. New York: Appleton-Century-Croft.

Neyman, J., & Pearson, E. S. (1928). On the use and interpretation of certain test criteria for purposes of statistical inferences (Part I). Biometrika, 20A, 175-240.

Oakes, M. (1986). Statistical inference: A commentary for the social and behavioral sciences. Chichester: John Wiley & Sons.

Phillips, L. D. (1973). Bayesian statistics for social scientists. London: Nelson.

Popper, K. R. (1968a). The logic of scientific discovery (originally published in 1959). New York: Harper & Row.

Popper, K. R. (1968b). Conjectures and refutations (originally published in 1962). New York: Harper & Row.

Presby, S. (1978). Overly broad categories obscure important differences between therapies. American Psychologist, 33, 524-515.

Rachman, S., & Wilson, G. T. (1980). The effects of psychological therapy. Oxford: Pergaman.

Rosenthal, R. (1983). Assessing the statistical and social importance of the effects of psychotherapy. Journal of Consulting and Clinical Psychology, 51, 4-13.

Rosenthal, R. (1984). Meta-analytic procedures for social research. Beverly Hills, CA: Sage.

Rosenthal, R., & Rubin, D. B. (1979). A note on percent variance explained as a measure of the importance of effects. Journal of Applied Social Psychology, 9, 395-6.

Rosenthal, R., & Rubin, D. B. (1982). A simple, general purpose display of magnitude of experimental effect. Journal of Educational Psychology, 74, 166-9.

Rosnow, R. L., & Rosenthal, R. (1989). Statistical procedures and the justification of knowledge in psychological science. American Psychologist, 44, 1276-84.

Rozeboom, W. W. (1960). The fallacy of the null-hypothesis significance-test. Psychological Bulletin, 57, 416-28.

Savin, H. B., & Perchonock, E. (1965). Grammatical structure and the immediate recall of English sentences. Journal of Verbal Learning and Verbal Behavior, 4, 348-53.

Schmidt, F. L. (1992). What do data really mean? Research findings, meta-analysis, and cumulative knowledge in psychology. American Psychologist, 47, 1173-81.

Schmidt, F. L. (1996). Statistical significance testing and cumulative knowledge in psychology: Implications for the training of researchers. Psychological Methods, 1, 115-29.

Schneider, W., & Shiffrin, R. M. (1977). Controlled and automatic human information processing: I. Detection, search, and attention. Psychological Review, 84, 1-66.

Siegel, S. (1956). Non-parametric statistics for the behavioral sciences. New York: McGraw-Hill.

Sohn, D. (1980). Critique of Cooper's meta-analytic assessment of the findings of sex differences in conformity behavior. Journal of Personality and Social Psychology, 39, 1215-21.

Sperling, G. (1960). The information available in brief visual presentations. Psychological Monographs, 74,(11, Whole No. 498).

Thompson, B. (1996). AERA editorial policies regarding statistical significance testing: Three suggested reforms. Educational Researcher, 25, 26-30.

Tukey, J. W. (1960). Conclusions vs. decisions. Technometrics, 2, 1-11.

Wilson, G. T., & Rachman, s. J. (1983). Meta-analysis and the evaluation of psychotherapy outcome: Limitations and liabilities. Journal of Consulting and Clinical Psychology, 51, 54-64.

Yngve, V. (1960). A model and an hypothesis for language structure. Proceedings of the American Philosophical Society, 104, 444-66.


Statistical power is questioned in this defence of NHSTP for two reasons. First, 'Type II error' is given a different definition in power analysis. Second, it needs two population distributions of raw scores to represent graphically the effect size or statistical power, whereas only one sampling distribution of the test statistic is used in NHSTP. The more telling difficulty is that the determination of the sample size is given too mechanical a treatment in power analysis. The methodological assumptions of the Bayesian approach to empirical research is examined because some criticisms of NHSTP have a heavy Bayesian overtone. The Bayesian approach is questioned because it cannot be used to test explanatory hypotheses.

Recall that cognitive psychologists design and conduct their experiments in ways that satisfy the formal requirement of an inductive principle (see Table 5 and the discussion in the 'Induction, Experimental Design and Controls' section).