This is the abstract of a book that will be accorded multiple book review in Behavioral and Brain Sciences (Copyright 1997: Cambridge University Press) and is shortly to be circulated for Multiple Peer Review. This preprint of the Precis is for inspection only, to help prospective reviewers decide whether or not they wish to review the book. Please do not prepare a review unless you have received the invitation, instructions and deadline information. It would be helpful if you let us know whether you already have the book or would require a copy.
For information on becoming a reviewer or commentator on this or other BBS target articles, write to: bbs@soton.ac.uk
For information about subscribing or purchasing offprints of the published version, with commentaries and author's response, write to: journals_subscriptions@cup.org (North America) or journals_marketing@cup.cam.ac.uk (All other countries).


Précis of "STATISTICAL SIGNIFICANCE: RATIONALE, VALIDITY AND UTILITY" London: Sage 1996

Siu L. Chow

Department of Psychology
University of Regina
Regina
Saskatchewan
CANADA S4S 0A2
chowsl@leroy.cc.uregina.ca

Keywords

Bayes' rule; conditional probability; confidence interval; deduction; effect size; experimental design; hypothesis testing; induction; likelihood ratio; power analysis; statistical inference

Abstract

The null-hypothesis significance-test procedure (NHSTP) is defended in the context of the theory-corroboration experiment, as well as the following contrasts: (a) substantive hypotheses versus statistical hypotheses, (b) theory corroboration versus statistical hypothesis testing, (c) theoretical inference versus statistical decision, (d) experiments versus nonexperimental studies, and (e) theory corroboration versus treatment assessment. The null hypothesis can be true because it is the hypothesis that errors are randomly distributed in data. Moreover, the null hypothesis is never used as a categorical proposition. Statistical significance means only that chance influences can be excluded as an explanation of data; it does not identify the nonchance factor responsible. The experimental conclusion is drawn with the inductive principle underlying the experimental design. A chain of deductive arguments gives rise to the theoretical conclusion via the experimental conclusion. The anomalous relationship between statistical significance and the effect size often used to criticize NHSTP is more apparent than real. The absolute size of the effect is not an index of evidential support for the substantive hypothesis. Nor is the effect size, by itself, informative as to the practical importance of the research result. Being a conditional probability, statistical power cannot be the a priori probability of statistical significance. The validity of statistical power is debatable because statistical significance is determined with a single sampling distribution of the test statistic based on H0, whereas it takes two distributions to represent statistical power or effect size. Sample size should not be determined in the mechanical manner envisaged in power analysis. It is inappropriate to criticize NHSTP for nonstatistical reasons. At the same time, neither effect size nor confidence interval estimate nor posterior probability can be used to exclude chance as an explanation of data. Nor can any of them fulfill the nonstatistical functions expected of them by critics.


PrÈcis of 'Statistical Significance: Rationale, Validity and Utility'

This précis of Statistical Significance: Rationale, Validity and Utility (Chow, 1996) begins with a description of the main themes of its eight chapters. As criticisms of the null-hypothesis significance-test procedure (NHSTP) are answered in the context of the theory-corroboration experiment, the rationale of theory corroboration and the logical foundation of experimentation are described after a description of NHSTP itself. It is argued that NHSTP can (and should) be defended when some conceptual or metatheoretical distinctions are made. 'Theory' and 'hypothesis' will be used interchangeably in subsequent discussion even though the former has a more grandiose connotation.

To begin with, as the statistical hypothesis is not the substantive hypothesis (Meehl, 1978), to corroborate a substantive hypothesis is more than testing a statistical hypothesis. Similarly, drawing a theoretical conclusion is more than deciding whether or not the result is statistically significant (Tukey, 1960). It further follows that research data and conclusions are not (and should not be) accepted or rejected on the mere basis of statistical significance. Some criticisms of NHSTP seem persuasive when these distinctions are not made. Other criticisms of NHSTP are based on criteria imported from domains outside statistics. A case will be made that the dissatisfaction with NHSTP stems from attempts to use it to fulfill functions that belong to the theory-corroboration or treatment-assessment process. The alternative numerical indices (viz., effect size, confidence interval estimate, and statistical power) proposed by critics of NHSTP (henceforth referred to as critics) cannot fulfill these nonstatistical functions.

1. An Overview of 'Statistical Significance'

'Statistical Significance' begins by recounting the commonly known criticisms of NHSTP in Chapter 1. Also described is the methodological paradox that psychologists may inadvertently find support for weaker theories when they improve their research methods (Meehl, 1967). The basic structure and rationale of NHSTP is illustrated with a completely randomized 1-factor, 2-level quasi-experiment in Chapter 2. It is shown that the null hypothesis can be true, particularly in experimental studies with manipulated variables. Also defended is the hybrid nature of NHSTP.

To distinguish between a substantive and a statistical hypothesis, the quartet of hypotheses associated with the to-be-studied phenomenon in the theory-corroboration experiment is introduced in Chapter 3. It is shown that the null hypothesis appears twice in NHSTP, once as the consequent and once as the antecedent of two conditional propositions. That statistical hypothesis testing is not theory corroboration is seen from the role statistical significance plays in the chain of deductive reasoning discussed in Chapter 4. The outcome of NHSTP is to supply the minor premise for the innermost of the series of three embedding conditional syllogisms.

Two meanings of 'effect' are identified in Chapter 5. The anomalous relationship between statistical significance and effect size is more apparent than real because, in terms of the technical meaning of 'effect,' the effect size is not indicative of the amount of evidential support for the substantive hypothesis offered by data. Nor is the effect size, by itself, informative about the practical importance of the research result. Some conceptual difficulties with power analysis are identified in Chapter 6. Being a conditional probability, statistical power cannot be the a priori probability of obtaining statistical significance. Some of the issues raised by power analysts are concerns about the stability of the data. It is argued that the stability issue is neither a numerical nor a mechanical one.

The methodological assumptions underlying Bayesian statistics are considered in Chapter 7. The applicability of the Bayesian approach is questioned because the prototype of empirical research congenial to the Bayesian argument is not typical of psychological research, particularly the theory-corroboration kind. Experimental data can be defended in a relativistic milieu. The main arguments in defense of NHSTP are summarized in Chapter 8 with reference to a set of questions suggested by criticisms of NHSTP.

2. Criticisms of NHSTP

NHSTP has been criticized since the 60's (Morrison & Henkel, 1970). The same litany of criticisms of NHSTP is repeated periodically by various critics, as is noted recently by Thompson (1996). Some of the commonly known difficulties of relying on NHSTP are that (a) statistical significance may be due to the fortuitous choice of the sample size or the a level, (b) the null hypothesis is never true, (c) nothing can be learned from statistical significance about the inverse probability of the hypothesis (i.e., the probability that the hypothesis is true, given the data), (d) the binary nature of NHSTP is antithetical to the fact that knowledge advances in an incremental manner, (e) statistical significance is not informative about the values of parameters, (f) the Type II error is unjustifiably neglected, and (g) nothing about the practical impact of the research result can be learned from its statistical significance.

Critics find it puzzling that psychologists persist in using NHSTP. This state of affairs indicates that NHSTP users suffer from distorted statistical intuitions and conceptual confusion (Gigerenzer, 1993). However, the resiliency of NHSTP is warranted. It can be shown that the criticisms of NHSTP are debatable. The frame of reference used in the present defence of NHSTP is suggested by Meehl (1990) and Cohen (1994), but they restrict their criticisms of NSHTP to non-experimental studies. Meehl (1967) adds that his criticisms are more applicable to experiments using subject variables (e.g., sex, race, educational level, etc.) than to those using manipulated variables (e.g., stimulus duration, method of training, etc.). These caveats raise two interesting questions:

Why should NHSTP be more problematic in the case of subject-variable experiments than manipulated-variable experiments? [Q1]

What renders NHSTP more satisfactory in an experiment than a non-experiment? [Q2]

Questions [Q1] and [Q2] suggest that many of the criticisms of NHSTP are not statistical in nature. The real issue is whether or not the research result is brought about by procedural artifacts or confounding variables. That is, criticisms of NHSTP are actually concerns about inductive conclusion validity (see Campbell & Stanley, 1963; Chow, 1992; Cook & Campbell, 1979).

3. The Quartet of Hypotheses Underlying the Theory-corroboration Experiment

In view of Questions [Q1] and [Q2], it may be instructive to reconsider the criticisms of NHSTP in the context of the theory-corroboration experiment. Moreover, some hitherto neglected distinctions may be seen more readily when such a frame of reference is adopted. For such an end, consider first the quartet of hypotheses implicated in the theory-corroboration experiment with reference to Table 1. (Ignore the entries in italics for the moment, i.e., Propositions [P1.1'] through [P1.5'].)

Table 1. The logical relations among the to-be-explained phenomenon, theory, research hypothesis, experimental hypothesis and statistical hypotheses (alternative and null) in a theory-corroboration experiment

Level of Discourse

What Is Said At The Level Concerned
To-be-explained phenomenon
The linguistic competence of native speakers of English
Substantive Hypothesis
The linguistic competence of native speakers of English is an analog of the transformational grammar. [P1.1]
Complement of Theory
The linguistic competence of a native speaker of English is not an analog of the transformational grammar. [P1.1']
Research Hypothesis
If [P1.1], then it is more difficult to process negative sentences than kernel sentences. [P1.2]
Complement of Research Hypothesis
If -[P1.1], then there is no difference in difficulty processing negative and kernel sentences.[P1.2']
Experimental Hypothesis
If the consequent of [P1.2], then it is more difficult to remember extra words after a negative sentence than a kernel sentence. [P1.3]
Complement of Experimental Hypothesis
If not the consequent of [P1.2], then it is equally difficult to remember extra words after a negative and a kernel sentence. [P1.3']
Statistical Alternative Hypothesis
If the consequent of [P1.3], then H1.* [P1.4]
Statistical Null Hypothesis
If not the consequent of [P1.3], then Ho. [P1.4']
Sampling Distribution of H1
If H1, then the probability associated with a difference between kernel and negative sentences as extreme as 1.729 standard error (tdf=19) units from an unknown mean difference is not known. [P1.5]
Sampling between Distribution
If H0, then the probability associated with a difference of H0 kernel and negative sentences as extreme as 1.729 standard error (tdf=19) units a mean difference of zero is 0.05 in the long run. [P1.5']

*H1 = mean of extra-sentence words recalled after negative sentences < mean of extra-sentence words recalled after kernel sentences.

H0 = mean of extra-sentence words recalled after negative sentences * mean of extra-sentence words recalled after kernel sentences.

Consider the phenomenon of linguistic competence that native speakers of English can understand and generate an infinite number of grammatical utterances. A hypothesis that has been used to explain this phenomenon is Miller's (1962) rendition of Chomsky's (1957) transformational grammar (see [P1.1] in Table 1). This psychological analog of the transformational grammar is a substantive hypothesis, and it is an explanatory theory.

Many theoretical implications follow from the hypothesis that transformational grammar is psychologically real. One such implication is that non-kernel sentences (e.g., negative sentences) are more difficult to process than kernel sentences. Specifically, while the kernel sentence is generated with the phrase-structure rules, a negative sentence requires the additional step of applying a negative transformation to the kernel sentence. The relationship between the substantive hypothesis and the implication in question is represented by [P1.2] in Table 1. The consequent of the conditional proposition, [P1.2], is the research hypothesis. However, in such a form, the research hypothesis is not well-defined enough for experimentation. For example, it is necessary to specify the nature of the processing involved.

The problem of vagueness with [P1.2] is resolved by stipulating (a) a well-defined experimental task in a specific setting, and (b) a dependent variable whose identity is independent of the substantive hypothesis. A simplified version of Savin and Perchonock's (1965) task may be used to illustrate the solution. Subjects are presented with 8 words after being shown either a kernel or a negative sentence on any trial. Suppose further that the repeated-measures design is used. That is, the same subjects receive both types of sentences in the course of the experiment.

The subjects are first to recall the sentence verbatim and then to recall as many of the 8 extra words as possible. In the context of this experimental situation and of the auxiliary assumption that the short-term store has a limited capacity (Miller, 1956), an implication of the consequent of [P1.2] is that it is more difficult to remember extra words after a negative sentence than a kernel sentence. This implication of the research hypothesis is the experimental hypothesis, which appears as the consequent of [P1.3] in Table 1.

As the experimental hypothesis is not amenable to statistical analysis in its present form, it is necessary to derive its implication at the statistical level. Specifically, the implication is that the mean of extra-sentence words recalled after negative sentences is smaller than that after kernel sentences. This implication is more commonly known as the statistical alternative hypothesis (H1), and it is the consequent of [P1.4].

Consider the logical complement of H1, in Table 1. It is stated that the mean of extra-sentence word recalled after negative sentences is equal to or larger than that after kernel sentences (see the consequent of [P1.4'] in Table 1). This logical complement of H1 is the statistical null hypothesis (H0). Given that whatever is true under the 'larger than' component of H0 is subsumed under the 'equal to' component, the 'larger than' component serves no further purpose in the present discussion.

That this appeal to H0 is neither contrived nor arbitrary may be seen from the entries in italics in Table 1. The steps of derivation of [P1.3'] from [P1.1'] are the same as those implicated in deriving [P1.3] from [P1.1]. Hence, [P1.3'] is not contrived if [P1.1'] is not an arbitrary assertion. Being the logical complement of [P1.1], [P1.1'] is not a whimsical statement. In other words, H0 is not as arbitrary as it has been characterized (see, e.g., Fisher, 1959; Rozeboom, 1960; Thompson, 1996).

The null hypothesis has two utilities. First, it is used to specify the sampling distribution of differences required for the test of significance (see [P1.5']). Second, a decision about H1 may be made through making a decision about H0 because these two statistical hypotheses are mutually exclusive and exhaustive (see the 'H0, Data and Chance Influences' discussion in Section 12 for an explication).

In sum, underlying the theory-corroboration experiment is a quartet of hypotheses, namely, the substantive, research, experimental, and statistical alternative hypotheses. It can be seen that neither H0 nor H1 is the substantive, research or experimental hypothesis. Hence, it becomes necessary to distinguish between testing a substantive hypothesis at the conceptual level with empirical data (i.e., theory corroboration) and testing a statistical hypothesis (viz., statistical hypothesis testing). At the same time, it is noted in [P1.5] in Table 1 that H1 cannot be used to specify the to-be-used sampling distribution of differences that underlies the t test because the magnitude of the difference between the means of the kernel and negative sentences is not specified in H1. The complement of H1 (i.e., H0) is used instead (hence, [P1.5'] in Table 1). This invites a closer examination of NHSTP, particularly in view of the generally accepted verdict that H0 is never true.

4. The Null-hypothesis Significance-test Procedure (NHSTP)

A consideration of how theory corroboration differs from statistical hypothesis testing may begin with a brief recounting of the rationale and procedure of NHSTP. Suppose that Savin and Perchonock's (1965) task is used, and the statistical alternative hypothesis is that fewer words are recalled after recalling negative sentences than kernel sentences. H1 and H0 are commonly (but misleadingly) written as follows under such circumstances.:

(a) H1: uunegative < ukernel

(b) H0: uunegative >= ukernel

Suppose further that the repeated-measures design is used, and there are 20 subjects. This experiment will be referred to as the 'kernel-negative experiment' in subsequent discussion. The usual a level is set at 0.05. Strictly speaking, the test is whether or not the associated probability, p, of the calculated t is smaller than 0.05. By 'associated probability' is meant 'the probability of [the calculated t] plus the probabilities of all more extreme possible values' under H0 (Siegel, 1965, p. 11). In actual practice, the t (dependent sample in this example) is calculated, and compared to the critical value of t (i.e., -1.728, df = 19, a = .05) for this particular one-tailed test.

This critical value of -1.729 is given by the appropriate t distribution, which is the standardization of the sampling distribution of differences (Siegel, 1956). The binary decision is to choose between 'calculated t -1.729' and 'calculated t > -1.729.' The outcome of this binary decision determines the choice between the two modus ponens arguments depicted in the two top panels in Table 2. If the calculated t is -1.729 or smaller, the decision is that the result is significant (i.e., the 'not H0' conclusion in the top left panel of Table 2). If the calculated t is larger than the critical value, it is decided that the result is not significant (i.e., the 'H0' conclusion in the top right panel of Table 2).

Table 2: Two conditional syllogisms (upper panel) and the disjunctive syllogism (lower panel) implicated in the null-hypothesis significance testing procedure (NHSTP)

Upper Panel

Criterion Exceeded
Criterion Not Exceeded
Major Premise
If Calculated t * (criterion = -1.729), then not H0
If Calculated t > (criterion = -1.729), then H0
Minor Premise
t * (criterion = -1.729) [e.g., Calculated t = -2.05]
t > (criterion = -1.729) [e.g., calculated t= -1.56]
Conclusion
Not H0
H0

Lower Panel:

Statistical Significance Obtained
Major Premise:
H1 or H0
Minor Premise:
Not H0
Conclusion:
Therefore, H1.

It is assumed that H1and H0 are mutually exclusive and exhaustive (see the 'H0, Data and Chance Influences' discussion in Section 12). Hence, denying H0 leads to accepting H1 by virtue of the disjunctive syllogism depicted in the lower panel of Table 2. The experimental conclusion drawn from a statistically significant result is that fewer words are recalled after recalling negative sentences than kernel sentences.

Of interest is the fact that the experimental conclusion is about the relationship between two variables (viz., sentence type and number of extra words recalled). However, theoretical conclusions go beyond a mere functional relationship between the independent and dependent variables. The theoretical interest is what the nature of the linguistic competence is. This more sophisticated meaning of research data at the theoretical level is not informed by the NHSTP exercise depicted in Table 2. This consideration has not featured in the debate about the validity or utility of NHSTP because discussants have in mind a different type of experiment (a point to be discussed in Section 21, the 'Differences Between the Utilitarian and Theory-corroboration Experiments'). To see how the theoretical meaning is extracted from experimental data, it is necessary to consider what constitutes the theory-corroboration process.

5. The Rationale of the Theory-corroboration Experiment

To corroborate the substantive hypothesis experimentally is to show that the experimental data are consistent with the tenability of the substantive hypothesis. That is, there is 'warranted assertibility' (Manicas & Secord, 1983). This idea suggests that a crucial consideration in theory corroboration is the logical relationship between the substantive hypothesis and the evidential data. Such a consideration requires more than a statistical decision. Also implicated is the judicious application of deductive and inductive logic in different stages of the exercise.

6. The Role of Deductive Logic in the Theory-corroboration Experiment

Table 1 shows that H1 is three implicative steps from the substantive hypothesis. At the same time, there is a chain of deductive reasoning leading from experimental data to the substantive hypothesis via H1, the experimental hypothesis and the research hypothesis. This series of deductive reasoning may be seen more readily if the logical relations among the quartet of hypotheses shown in Table 1 are expressed in the form of a series of three embedding conditional syllogisms, as in Table 3.

Table 3. The series of three embedding syllogisms (in normal font, italics, and boldface, respectively) underlying the theory-corroboration procedure when the null hypothesis is rejected

Major Premise 3

If [P1.1]1 in Table 1, then [P3.1].2
[MAJ-3.3]7
Major Premise 2
If [P3.1]2, then [P3.2].3
[MAJ-3.2]6
Major Premise 1
If [P3.2], then H14
[MAJ-3.1]5
Minor Premise 1
H1 is true.
[MIN-3.1]
Conclusion 1
Therefore, [P3.2] is true in the interim by virtue of experimental controls)
[CON-3.1]
Minor Premise 2
[P3.2] is true in the interim.
[MIN-3.2]
Conclusion 2
Therefore, [P3.1] is true in the interim (by virtue of experimental controls).
[CON-3.2]
Minor Premise 3
[P3.1] is true in the interim.
[MIN-3.3]
Conclusion 3
Therefore, [P1.1] in Table 1 is true in the interim (by virtue of experimental controls).
[CON-3.3]

1 [P1.1] in Table 1 = The linguistic competence of a native speaker of English is an analog of the transformational grammar.

2 [P3.1] = It is more difficult to process negative sentences than kernel sentences (i.e., the consequent of [P1.2] in Table 1).

3 [P3.2] = It is more difficult to remember extra words after a negative sentence than a kernel sentence (i.e., the consequent of [P1.3] in Table 1).

4 H1 = mean of extra-sentence words recalled after negative sentences < mean of extra-sentence words recalled after kernel sentences.

5 [MAJ-3.1] is [P1.4] in Table 1.

6 [MAJ-3.2] is [P1.3] in Table 1.

7 [MAJ-3.3] is [P1.2] in Table 1.

The syllogisms in Table 3 are called 'conditional syllogisms' because their major premises are conditional propositions (viz., [MAJ-3.1], [MAJ-3.2] and [MAJ-3.3]). The first (or the innermost) syllogism is made up of [MAJ-3.1], [MIN-3.1] and [CON-3.1]. The second syllogism consists of [MAJ-3.2], [MIN-3.2], and [CON-3.2]. [MAJ-3.3], [MIN-3.3] and [CON-3.3] collectively make up the last syllogism.

The minor premise of the first syllogism (i.e., [MIN-3.1]) is the outcome of NHSTP. The example depicted is one in which the data permit the rejection of H0. To have established statistical significance is to accept that H1 is true. To assert that H1 is true in the first syllogism is to affirm the consequent of the conditional proposition, [MAJ-3.1]. The tentative conclusion is drawn that the antecedent of [MAJ-3.1] is true. This conclusion is used as the minor premise of the second syllogism to affirm the consequent of [MAJ-3.2]. This leads to the tentative conclusion that the antecedent of [MAJ-3.2] is true. Lastly, the conclusion of the second syllogism serves as the minor premise of the third syllogism. The antecedent of [MAJ-3.3] is concluded true tentatively when its consequent is affirmed by the antecedent of [MAJ-3.2].

7. The Modus Tollens and Affirming the Consequent Asymmetry

Note that all three conclusions in Table 3 (i.e., [CON-3.1], [CON-3.2] and [CON-3.3]) are qualified with the caveat, 'in the interim (by virtue of experimental controls).' The 'in the interim' qualification is necessary because there are alternative substantive hypotheses at the conceptual (see Section 32, the 'Alternative Substantive Hypothesis versus Statistical Alternative Hypothesis,' for an elaboration). The 'by virtue of experimental controls' qualification is necessary because deductive logic does not permit accepting the antecedent of a conditional proposition when its consequent is affirmed (Copi, 1982). Hence, the propriety of accepting the antecedents of [MAJ-3.1], [MAJ-3.2] and [MAJ-3.3] in Table 3 has to be warranted by experimental controls, as discussed in Section 8, 'Induction, Experimental Design and Controls.'

Suppose that the outcome of NHSTP does not permit rejecting H0. The chain of reasoning is shown in Table 4, in which the propositions in Table 3 are given a different set of numbers for identification purposes. For example, [MAJ-3.1] in Table 3 becomes [MAJ-4.1] in Table 4.

Table 4. The series of 3 embedding syllogisms (in roman font, italics, and boldface, respectively) underlying the theory-corroboration procedure when the null hypothesis is not rejected

Major Premise 3

If [P1.1]1 in Table 1, then [P4.1].2
[MAJ-4.3]7

Major Premise 2

If [P4.1], then [P4.2].3
[MAJ-4.2]6

Major Premise 1

If [P4.2], then H1.4
[MAJ-4.1]5

Minor Premise 1

H1 is not true.
[MIN-4.1]

Conclusion 1

Therefore, [P4.2] is not true.

[CON-4.1]

Minor Premise 2
[P4.2] is not true.
[MIN-4.2]

Conclusion 2

Therefore, [P4.1] is not true.

[CON-4.2]

Minor Premise 3
[P4.1] is not true.
[MIN-4.3]

Conclusion 3

Therefore, [P1.1] in Table 1 is not true.

[CON-4.3]

1[P1.1] in Table 1

The linguistic competence of a native speaker of English is an analog of the transformational grammar.

2[P4.1]

It is more difficult to process negative sentences than kernel sentences (i.e., the consequent of [P4.2] in Table 1).
3[P4.2]
It is more difficult to remember extra words after a negative sentence than a kernel sentence (i.e., the consequent of [P4.3] in Table 1).

4H1

Mean of extra-sentence words recalled after negative sentences < mean of extra-sentence words recalled after kernel sentences.

5[MAJ-4.1]

is [P1.4] in Table 1.
6[MAJ-4.2]
is [P1.3] in Table 1.
7[MAJ-4.3]
is [P1.2] in Table 1.

The minor premise of the first conditional syllogism in Table 4, [MIN-4.1], is 'Not-H1.' Hence, the antecedent of [MAJ-4.1] is rejected by modus tollens. The minor premise of the second syllogism, [MIN-4.2] is, in such an event, the denial of the consequent of [MAJ-4.2]. The modus tollens rule leads to the rejection of the antecedent of [MAJ-4.2]. Hence, [MIN-4.3] is the negation of the antecedent of [MAJ-4.2]. Consequently, [MIN-4.3] is the denial of the antecedent of [MAJ-4.3]. The third application of the modus tollens rule leads to the rejection of the antecedent of [MAJ-4.3], namely, [P1.1].

Unlike the case of affirming the consequent, modus tollens (i.e., denying the consequent of a conditional proposition) permits the unambiguous rejection of the antecedent of the conditional proposition. The difference between the arguments in Tables 3 and 4 is the asymmetry between modus tollens refutation and affirming the consequent confirmation of theories identified by Meehl (1967, 1978). It is noted here that the asymmetry is not brought about by using NHSTP. Instead, it is the consequence of the deductive reasoning implicated in corroborating theories. Hence, it is necessary to consider why affirming the consequent of [MAJ-3.1] (i.e., rejecting H0) does not guarantee the truth of its antecedent.

8. Induction, Experimental Design and Controls

Boring (1954, 1969) and Campbell (1969; Campbell & Stanley, 1963) pointed out that to consider experimental controls was to consider Mill's (1973) methods of scientific inquiry (with the exception of his method of agreement; see Cohen & Nagel, 1934). That is to say, underlying a valid experimental design is one of Mill's (1973) inductive methods (viz., method of difference, joint method of agreement and difference, method of residue, and method of concomitant variations). This may be illustrated with Table 5, in which is depicted the repeated-measures 1-factor, 2-level design used in the kernel-negative experiment described earlier.

Table 5. The inductive basis of the repeated-measures 1-factor, 2-level design (Method of Difference)

Condit-ion

Independent Variable

Control Variables






Extraneous Variables






(Sentence-type)

C1

C2

C3

C4

C5

C6

E1

E2

...

En


Dependent Variable

Control

Kernel Sentence
NI
T
I
R
S
C
ER
IT
...
M

Number of extra words recalled

Experimental

Negative Sentence
NI
T
I
R
S
C
ER
IT
...
M

Number of extra words recalled

C1 =

Normal intonation (NI)

C2 =
Task presentation via recorded tape (T)

C3 =
Interval between end of sentence and beginning of words (I)

C4 =
Rate of word presentation; 3/4 second per word (R)

C5 =
Structure of sentence; 'Animal' subject, present perfect transitive verb (S)

C6 =
Fixed categories of words used in 'extra' words (C)

E1 =
Extra-curricular reading (ER)

E2 =
Individual interests (IT)

En =
Kernel and negative sentences randomly mixed (M)

The design of the kernel-negative experiment is described in Table 5 in a way that reflects the inductive principle of Mill's (1973) method of difference (see Chow, 1992). Suppose that fewer words are recalled after negative sentences than after kernel sentences, and that the difference is statistically significant. The control variables (C1, C2, C3, C4, C5 and C6) can be excluded as explanations of the significant difference because each of them (e.g., C1) is represented by the same value (viz., NI) at both levels of the independent variable. This is one of the 'constancy of condition' meanings of the term 'control' (Boring, 1954, 1969).

The extraneous variables (E1, E2, ... En) may also be excluded because each of them (e.g., E1) is assumed to be represented at the same level (viz., ER). This assumption is justified by the fact that the same subject is tested in both the experimental and control conditions. Consequently, the difference between the 'Kernel' and 'Negative' conditions is rendered unambiguous by the fact that the experimental and control conditions are identical in all aspects but one. The only difference is brought about by the difference between the two levels of the independent variable).

9. Conflating NHSTP with Theory Corroboration

NHSTP is misunderstood because no distinction is made between the substantive and statistical hypotheses. Specifically, Meehl (1967) notes that there is a tendency to conflate the substantive hypothesis with the statistical hypothesis. This practice seems to be condoned when it is said, "the critical distinction between a statistical hypothesis and a substantive theory often breaks down. To perform a significance test a substantive theory is not needed at all" (Oakes, 1986, p. 42, emphasis in italics added).

What is said in the italicized sentence is true, but not because the distinction between the substantive and statistical hypotheses is unimportant or not real. It is true simply because testing a hypothesis at the statistical level (see Table 2) and corroboration a substantive hypothesis with empirical data at the conceptual level (viz., Table 3) are radically different exercises. This issue will be dealt with further in the `Differences Between the Utilitarian and Theory-corroboration Experiments' discussion in Section 21.

10. Answers to Questions [Q1] and [Q2]

It may be concluded from the foregoing argument that, to the extent that all recognized control variables and procedures are included in the experiment, the statistically significant result may be attributed to the independent variable (Campbell, 1969). The experiment is said to have inductive conclusion validity under such circumstances (Chow, 1987a, 1992). For this reason, the propriety of accepting the antecedent of a conditional proposition by affirming its consequent in Table 3 is justified with the 'in the interim' proviso.

The answer to Question [Q1] may be seen readily from Table 6. Suppose that the kernel-negative experiment is conducted to assess the differential linguistic competence of science students from two disciplines. Neither the repeated-measures nor the completely randomized design can be used. Hence, different selected groups of subjects have to be assigned to the two levels of the independent variable, Faculty of Study. While it is possible to maintain the constancy of condition in the case of some control variables, such is not the case with the extraneous variables. An extraneous variable (e.g., E1) may be represented at different levels in the experimental and control conditions (viz., ER and ER', respectively) as a result of some fundamental differences between students of the two disciplines.

Table 6. Violation of the formal requirement of Method of Difference when a subject variable is used


Subject Variable

Control Variables






Extraneous Variables






(Faculty of Study)

C1

C2

C3

C4

C5

C6

E1

E2

...

En


Dependent Variable

C

Biological Sciences
NI
T
I
R
S
C
ER
IT'
...
M''

Number of extra words recalled
E
Physical Sciences
NI
T
I
R
S
C
ER'
IT
...
M'

Number of extra words recalled

C1 =

Normal intonation (NI)

C2 =
Task presentation via recorded tape (T)

C3 =
Interval between end of sentence and beginning of words (I)

C4 =
Rate of word presentation; 3/4 second per word (R)

C5 =
Structure of sentence; 'Animal' subject, present perfect transitive verb (S)

C6 =
Fixed categories of words used in 'extra' words (C)

E1 =
Extra-curricular reading (ER or ER')

E2 =
Individual interests (IT or IT')

En =
Kernel and negative sentences randomly mixed (M' or M'')

In short, the design of an empirical research is a description of how the data collection conditions are arranged. The empirical study is an experiment if the arrangement of its data collection conditions satisfies the formal requirement of one of Mill's (1973) inductive principles. The formal requirement makes it possible to exclude as explanations those factors that have been incorporated in the design as control variables or procedures. Various aspects of the formal requirement give rise to the three technical meanings of 'control.' They are (a) a valid comparison baseline, (b) constancy of conditions, and (c) provisions for excluding procedural artifacts (Boring, 1954, 1969; Chow, 1987a, 1992). Data interpretation becomes unambiguous to the extent that all recognized alternative interpretations are excluded by the judicious application of experimental controls (Campbell, 1969)..

An empirical study is a quasi-experiment when its design satisfies only some parts of the formal requirement. A non-experimental study (e.g., the correlational study) is one which there is no formal provision for satisfying the formal requirement. Hence, there is no provision for excluding alternative interpretation of the result in non-experimental studies. Given the fact that experimental controls serve to exclude explanations, it can be seen that data from quasi-experimental and non-experimental studies are more ambiguous than experimental data. This is the answer to Question [Q2]. The comparison between Tables 5 and 6 provides the answer to Question [Q1]. These answers to Questions [Q1] and [Q2] lead to the realization that some criticisms of NHSTP are motivated by ambiguities in data interpretation. At the same time, a few criticisms arise because the nature of H0 is misunderstood or misrepresented.

11. The Nature of H0

What is clear from the discussion of Tables 2 and 3 is that whether or not the experimental data support the substantive hypothesis is not determined by NHSTP. Supplying the minor premise for the first syllogism in Table 3 or 4 is the only contribution NHSTP has to theory corroboration. The theoretical meaning of the experimental data is conferred by their logical relation with the experimental, research, and substantive hypotheses. Although statistical significance does not confer any theoretical meaning to data, it does have an important function. Specifically, it provides a rational basis for excluding chance influences as an explanation of data. This important (although limited) role may be seen from a closer examination of the statistical null hypothesis, H0.

12. H0, Data and Chance Influences

One way to paraphrase the antecedent of [P1.3'] in Table 1 is to say that the subjects are indifferent to whether the to-be-remembered sentence is a kernel or a negative sentence. Consequently, under such circumstances, any observed difference between the means of the 'Negative' and the "Kernel' conditions is the result of chance influences (or errors). That is, actual measurements made during data collection may be affected by unintended non-systematic influences (i.e., errors) of various kinds. Consequently, [P1.3'] in Table 1 may be represented as the conditional proposition, [P7.1], in Table 7. By the same token, [P1.4] in Table 1 may be represented by [P7.2] in Table 7.

Table 7. The statistical null hypothesis (H0) and the statistical alternative hypothesis (H1) as components of conditional propositions

Where in Table 1

Conditional Proposition


[P1.4']
If chance, then H0.
[P7.1]
[P1.4]
If not chance, then H1.
[P7.2]

If Ho, then the test statistic is distributed as a sampling distribution of the difference whose mean difference is zero.

[P7.3]

The representation adopted for H0 and H1 in Table 7 serves three functions. First, it highlights the meaning of the null hypothesis. It is a hypothesis about the influence of non-systematic chance factors on data in the form of distributing the unintended influences randomly between the two conditions. Moreover, the errors are normally distributed with a mean of zero in each condition. Consequently, a statistically significant result will be correctly interpreted to mean only that an explanation of the data in terms of chance influences can be excluded with the level of strictness stipulated by the significance level (viz., a).

Second, Table 7 makes explicit the mutually exclusive and exhaustive relationship between H0 and H1. That is, the contrast between H0 and H1 is informed by neither the substantive hypothesis nor the to-be-studied phenomenon. Instead, the contrast is informed by the data-collection procedure. It is a contrast between chance and not chance. That NHSTP is actually mute at the level of the substantive hypothesis may be seen from the fact that, in the event the result is statistically significant, the non-chance factor responsible for the data is not informed by statistical significance.

The third function of the tabular representation of Table 7 is to make explicit the fact that H0 is not used as a categorical proposition. It appears twice; once as the consequent of the conditional proposition [P7.1], and once as the antecedent of the conditional proposition [P7.3]. This state of affairs means that, even if 'H0 is never true' were true, its contribution to the statistical decision process would not be affected because the truth of either [P7.1] or [P7.3] is not determined by the truth value of H0 alone, but by the truth values of both the antecedent and consequent (Copi, 1982). At the same time, it is important to emphasize that H0 can (and should) be true, the common belief to the contrary notwithstanding.

13. 'H0 is never true' Revisited

Consider the antecedent of the conditional proposition, [P1.3'] in Table 1. It says that there is no difference in difficulty in processing negative and kernel sentences. In other words, H0 is a hypothesis about the relationship between two theoretical populations, 'Kernel' and 'Negative' (viz., the hypothesized population of all subjects presented with kernel sentences and that of all subjects presented with negative sentences). In view of the fact that two populations are implicated in H0 (not just one), it is not clear what H0 is about when only one population is acknowledged, as in the statement, 'A null hypothesis is any precise statement about a state of affairs in a population, usually the value of a parameter, frequently zero' (Cohen, 1990, p. 1307, emphasis in italics added).

The assertion, "things get downright ridiculous when H0 is to the effect that the effect size (ES) is 0--that the population mean difference is 0" (Cohen, 1994, p. 1000, emphasis in italics added), is questionable for a different reason. Two theoretical populations are properly recognized in this statement if 'population mean difference' refers to the mean of the sampling distribution of differences. It needs two population distributions to give rise to a sampling distribution of differences. However, it can be shown that it is not ridiculous to have a mean difference of zero for the sampling distribution of differences.

Recall the two theoretical populations, 'Kernel' and 'Negative,' in the kernel-negative experiment They are procedurally defined populations. Specifically, they are defined in terms of the two levels of the independent variable, Sentence-type. The data-collection situation in experimental psychology can be (and should always) made to ensure that the two procedurally defined populations be identical if the subjects are indeed indifferent to the difference between the two levels of the independent variable. This is effected in different situations by using the repeated-measures design, the matched-pair design, or the completely randomized design.

As an example, consider the repeated-measures design. The two test conditions (viz., presenting kernel sentences and presenting negative sentences) are imposed on the same group of subjects. This group of subjects becomes two hypothetical samples when described in terms of the two respective levels of the independent variable. The two hypothetical samples are identical before being exposed to the experimental manipulation. They remain identical if what is said in the experimental hypothesis is false. Why should the 'Kernel' and 'Negative' populations not have the same mean if the complement of the experimental hypothesis is true? Why is it ridiculous to expect the difference between the 'Kernel' and 'Negative' populations be zero at the statistical level if the subjects are indifferent to the experimental manipulation? In other words, critics have not taken into account the fact that the null hypothesis is about neither the to-be-studied phenomenon nor some actual substantive populations. The null hypothesis is about the relationship between two or more procedurally defined hypothetical populations.

It is important to emphasize that the truth of HO depends on assigning subjects randomly to the experimental and control conditions or using the same subjects in both conditions. This iteration is necessary in view of a recent attempt to question the assertion, `HO is never true,' with the following debatable scenario:

We give a placebo to a control group and [the to-be-tested] drug to the experimental group. We then mix these participants into one group .. (Hagen, 1997, p. 16)

The data collection procedure depicted is unsatisfactory because it does not guarantee that the formal requirement of Mill's (1973) method of difference is met. This example may also be used to make the case the validity of NHSTP must be assessed in the context of research methods.

In short, H0 can be true. More important, it ought to be true if the data-collection procedure is set up and conducted properly (hence, the importance of Cohen's, 1994, and Meehl's, 1990, caveat identified in Question [Q2]). The assertion, 'H0 is never true,' seems self-evident only when H0 is used as a categorical proposition descriptive of an ill-defined state of affairs. On the contrary, it is actually a statement about how the data are collected, a point also noted by Bakan (1966) and Phillips (1973).

More important, H0 is never used as a categorical propositions. At one level of discourse (viz., [P1.4'] in Table 1), H0 is a description of the data when certain assumptions or conditions are satisfied in the data-collection situation (a point emphasized by Falk & Greenbaum, 1995). At a different level of discourse (i.e., [P1.5'] in Table 1 or [P7.1] in Table 7), H0 is a criterion for rejecting chance influences as an explanation of data. What renders H0 indispensable is that it stipulates the to-be-used sampling distribution of the test statistic required for making the decision about chance influences (see [P7.3] in Table 7 or [P1.5'] in Table 1).

14. The Ambiguity-Anomaly Criticisms of NHSTP

A statistically significant result is considered ambiguous by critics. They also find the relationship between statistical significance and the effect size anomalous. The ambiguity and anomaly stem from the fact that statistical significance may be the fortuitous consequence of having chosen a particular sample size. Consider Studies A and B in Table 8.

Table 8. The putative ambiguity and anomaly of significance tests illustrated with four fictitious studies

Study

uE

uC

Effect size*
Statistical Test (e.g., t ) significant?

df

A
6
5
0.1
Yes
22

B

25
24
0.1
No
8

C

17
8
0.9
No
8

D

8
2
0.5
Yes
22

* J. Cohen's (1987) d

Although the effect size is the same in both Studies A and B, the result is significant in Study A, but not B. At the same time, the sample size is larger in Study A than in Study B. This is the basis of the sentiment shared among critics that statistical significance is assured if a large enough sample is used (see Thompson, 1996, for a recent expression of this view). By the same token, a result may be non-significant because too small a sample is used. This difficulty may be called the sample size-dependence problem.

Study A is significant and Study C is not significant. Yet, the effect size is larger in Study C than in Study A. This is considered an anomaly, and it may be called the incommensurate significance-size problem. This problem suggests to critics that statistical significance is misleading at best, harmful at worst. The harm NHSTP does to research is that it precludes researchers from utilizing more profitably the quantitative information in the data. Specifically, if researchers are satisfied with the NHSTP result, they may neglect to determine the confidence interval estimate of the parameter.

Studies A and D in Table 8 jointly show that the incommensurate significance-size problem may assume the form of the magnitude-insensitivity problem. Their results are significant. However, the effect in Study D is larger than that in Study A. This useful information is not put to good use. The same point may be illustrated with Studies B and C. Although their results are not significant, the effect is larger in Study C than B. Again, the magnitude of the effect should be used (e.g., in meta-analysis; Glass, McGaw, & Smith, 1981; Schmidt, 1996).

A closer examination of the following issues shows that these criticisms themselves are debatable. First, the ambiguity is a conceptual or methodological problem, not a quantitative issue. Second, the effect size and NHSTP express the difference between the means of the experimental and control groups at different levels of abstraction. Third, parameter estimation is not theory corroboration. Fourth, non-statistical concerns cannot be addressed with statistics indices. Fifth, the validity of meta-analysis, as a theory-corroboration tool, can be questioned.

15. The Sample Size-Significance Dependence Problem Revisited

A persistent theme found in criticisms of NHSTP is that the fortuitous choice of the sample size (e.g., an unjustifiably large sample) may be responsible for a statistically significant result. However, Questions [Q1] and [Q2] suggest that the issue may have nothing to do with the sample size at all. The real concern may be questions about the internal validity of the research (Campbell & Stanley, 1963; Cook & Campbell, 1979). Be that as it may, that statistical significance may be questioned suggests that there are good reasons why affirming the consequent of [MAJ-3.1], [MAJ-3.2] or [MAH-3.3] in Table 3 does not guarantee the truth of its antecedent. This may be seen more readily from the following non-experimental study.

Suppose that the effects of institutional constraints on a rehabilitation programme is assessed with a correlational study. It is found that the efficacy of the rehabilitation programme varies inversely with the number of institutional constraints. What does it mean to dismiss the study for the simple reason that the sample size is unusually large (e.g., n = 1,000)?

Note that to question the statistically significant result in this example is to question the conclusion that institutional constraints are really related to the failure of the rehabilitation programme. That is, this is a question about data interpretation (a conceptual concern), not about the numerical value of the test statistic or the sample size. Hence, it is necessary to consider the 'fortuitous sample size' argument more closely, not in quantitative terms, but in qualitative terms. That is, the issue is why it is more likely to introduce confounding variables when more participants are included in the correlational study.

To increase the sample size is to recruit more participants in the correlational study. Chances are that the participants would have to be recruited from more diverse settings. Consequently, not only does the chance of having a confounding variable increase, it also becomes more difficult to identity the confounding variable. The result (be it statistically significant or non-significant) becomes more ambiguous regarding the relationship between institutional constraints and the efficacy of the rehabilitation treatment of interest. More important, it would not be valid to apply the chain of reasoning depicted in Table 3 under such circumstances. As may be recalled from Table 5, the situation is very different in the case of the experiment because of experimental controls.

Why is the sample size-significance dependence problem not seen by critics as a concern about the internal validity of the research? The real source of the ambiguity is obscured by the suggestion that statistical significance may be manipulated by cynical researchers. Specifically, it is intimated by some critics that cynical researchers use excessively large samples if their interests are vested in a statistically significant result, but small samples if their vested interests are served by a non-significant result. However, that a tool may be misused speaks ill only of its users. It does not mean that the tool itself is unsatisfactory, particularly when nothing inherent in the tool invites its being misused.

It should be possible to dismiss the cynicism issue as irrelevant were there not the impression that psychologists accept (or do not accept) a research conclusion on the sole basis of statistical significance (or a non-significant result). The impression is misleading. For example, cognitive psychologists do not accept or reject a finding on the mere basis of statistical significance or non-significance (see, e.g., Coltheart's, 1980 or Haber's, 1983, discussion of the iconic store). Cognitive psychologists examine assiduously whether or not (a) a proper experimental design has been used in the experiment, (b) subjects have been given sufficient training, (c) all recognizable control variables or procedures are properly instituted, and (d) the correct statistical procedure is used.

In short, experimental psychologists are very meticulous about the internal validity of experiments (viz., both the inductive conclusion validity and statistical conclusion validity). They are aware that a statistically significant result may be ambiguous at the conceptual level as a result of various features found in the data-collection procedure or situation. In actual fact, experimental psychologists are so conscientious about the inductive conclusion validity issues that their attempts to eliminate conceptual or methodological ambiguities have recently been dismissed as 'methodolatory' (Danziger, 1990) or 'scientific rhetoric' (Gergen, 1991).

The realization that the ambiguity issue has nothing to do with NHSTP obviously has important implications on how to reduce ambiguity. For example, the ambiguity cannot be reduced by testing more subjects or analyze parts of the data (as envisaged in Hunter & Schmidt's, 1990, psychometric meta-analysis). Nor can another numerical index be used to disambiguate the statistically significant result (be it the effect size or statistical power). It is instructive to recall the following observation:

The sum total of the reasons which will weigh with the investigator in accepting or rejecting the [substantive] hypothesis can very rarely be expressed in numerical terms. All that is possible for him is to balance the results of a mathematical summary, formed upon certain assumptions, against other less precise impressions based upon [daggerdbl] priori or [daggerdbl] posteriori considerations. (Neyman & Pearson, 1928, p. 176; emphasis in boldface and explication in square brackets added) [Quote 1]

Two obvious examples of Neyman and Pearson's (1928) numerical terms are statistical power and the effect size. An example of the [daggerdbl] priori considerations is the choice between the repeated-measures and completely randomized designs. The consideration as to whether or not there is any confounding variable after the completion of the experiment is an example of the a posteriori considerations in question.

16. Two Levels of Abstraction - Statistical Significance and Effect Size

An assumption must be made explicit before one can assess whether or not Studies A and C in Table 6 suggests that statistical significance is anomalously related to the effect size. Specifically, it is necessary to assume that statements about statistical significance and the effect size are at the same level of abstraction. A look at how t and the effect size are respectively defined in Equations [Eq. 1] and [Eq. 2] suggests otherwise.

[a] t = {(Mean 1 - Mean 2) - (u1 - u2)}/standard error of differences [Eq. 1]

[b] d = (Mean 1 - Mean 2)/standard deviation of Group 1 [Eq. 2]

The (u1 - u2) component of the numerator of Equation [Eq. 1] is zero if the implication of chance influences is that u1 = u2 (Kirk, 1984). Consequently, the numerator is the same in both equations, namely, the difference between the two sample means. On the one hand, the denominator in [Eq. 1] is the standard error of differences. It is a property of a theoretical distribution, namely, the sampling distribution of differences. This distribution is at a level more abstract than the population of raw scores. The denominator in [Eq. 2], on the other hand, is the standard deviation of one of the two conditions in [Eq. 2]. This is a property of the population of raw scores. It follows that the test statistic used in NHSTP and the effect size are indices belonging to two different levels of abstraction. It seems neither valid nor appropriate to say that the relationship between statistical significance and the effect size is anomalous under such circumstances. This issue of mixing two levels of abstraction will surface again in the discussion of power analysis.

17. Effect Size, the Binary NHSTP Decision and Evidential Support

Two points are emphasized in the anomaly critiques of NHSTP. First, the NHSTP result is a binary decision (i.e., significant versus non-significant). Second, the effect size is a continuous variable. However, the propriety of juxtaposing statistical significance and the effect size may also be questioned for the following reasons. First, these criticisms are made with the assumption that H1 is the substantive hypothesis. However, critics have not taken into account the facts that H1 is the complement of H0, and that H0 is a hypothesis about chance influences on data. In other words, H1 is neither the substantive nor the experimental hypothesis. It is but a statement to the effect that chance influences may be ruled out as an explanation of data. Consequently, to say that the result is statistical significant is to say something about the data and their collection. Statistical significance does not say anything about the substantive hypothesis.

The second reservation about critics' juxtaposing statistical significance and the effect size is a meta-theoretical one. To suggest supplementing statistical significance with the effect size in the theory-corroboration experiment is to say that the effect size has something to contribute to the evidential support for the substantive hypothesis. In view of the argument that the warranted assertibility offered by experimental data is conferred by the implicative relations among the quartet of hypotheses (see Table 3) and the inductive principle underlying the experimental design (see Table 5), not by statistics, the putative importance of the effect size can be discounted. The effect size has no role in either the deductive or the inductive reasoning depicted in Tables 3 and 4. It follows that a larger effect size does not mean a greater support for the substantive hypothesis (see also Chow, 1988). At the same time, the binary NHSTP suffices to provide the minor premise for the first conditional syllogism depicted in Table 3.

18. Effect Size and Practical Importance

Something seems amiss to critics when nothing can be learned about the practical impact of the statistically significant research result. It is suggested that this shortcoming is the result of relying on NHSTP. Moreover, it can be rectified by reporting the effect size, particularly when the binomial effect-size display (BESD) is used (Rosenthal & Rubin, 1979, 1982). This may be called the 'effect informs impact' claim. Of interest are (a) the fact that the argument in support of the claim is incomplete, and (b) the reason why the claim intrudes into the assessment of NHSTP. This discussion will make it understandable the unwarranted practice of conflating statistical hypothesis testing with theory corroboration.

19. The 'Effect Informs Impact' Claim Revisited

There is a conceptual gap in the 'effect informs impact' claim. Consider the correlation coefficient, r, between medication (aspirin versus placebo) and myocardial infarction, MI (absence or presence) in Rosnow and Rosenthal's (1989) illustration. The r is used as an index of the effect size. What BESD does effectively is to convert the Pearson r = .034 into the 'change in success rate' in the form of a percentage, where by 'success' is meant the absence of MI in the illustration. The 'success rates' for the aspirin and placebo conditions are given, respectively, by Equations [Eq. 3] and [Eq. 4] as follows:

[a] The success rate for the Aspirin Condition: .5 + r/2 [Eq. 3]

[b] The success rate for the Placebo Condition: .5 - r/2 [Eq. 4]

The change in success rate is simply the difference between [a] and [b]. It turned out to be 3.4%. The conclusion is drawn that the implications of an effect of this magnitude is 'far from unimpressive' (Rosnow & Rosenthal, 1989, p. 1279), despite the fact that an r = 0.034 is statistically non-significant.

The BESD is justified on the grounds that it is 'intuitively appealing ... [and] easily understood by researchers, students, and lay persons' (Rosenthal, 1983, p. 11). The difficulty is that the validity of this justification itself is by no means self-evident. It is simply not clear why the said rate of 3.4% is impressive. Would the same rate of change be impressive if the research is about the attitude change of some obscure film critics? Would it be more impressive if the film critics are prominent ones? It seems that, in the 'Aspirin-MI' example, a change in success rate of 3.4% owes its impressiveness to the nature of the to-be-monitored phenomenon (viz., incidents of MI), not to the magnitude of the change itself.

There is also the following question. To whom is the effect size impressive? A 3.4% change in the attitudes of film critics may not impress those who are interested in artistic issues. However, it may have a greater impact on film producers when they consider the monetary implications. In other words, impressiveness is in the eye of the beholder, not the size of the effect per se.

In short, by itself, the effect size says nothing about the practical impact of the result. What is required is some criteria that relate the effect size to the judgment about impressiveness or practical impact. These criteria are outside the domain of statistics. Moreover, these criteria are domain-specific. Consequently, the claim that BESD is the general purpose index of practical impact is questionable. At the same time, the propriety of criticizing NHSTP in terms of practical validity may also be questioned because statistics and practical impact belong to different domains

20. The Intrusion of Non-statistical Issues

The kernel-negative experiment used to introduce the rationale and procedure of NHSTP is like neither the examples used to introduce NSHTP in statistics textbook nor those used in criticisms of NHSTP. The commonly used examples are studies used to ascertain the effectiveness of a course of action or treatment (e.g., using a new method to teach statistics). Typically, the new method is applied to one class of students, whereas the traditional method is used in another class of students. The mean performance of the two classes is tested with NHSTP. The only concern is whether or not the new method of teaching produces a better result. This is an issue about treatment assessment. The question as to why the new method produces a better result is often not an issue. Experiments of this type are tokens of the agricultural model experiments (Hogben, 1957; Meehl, 1978; Mook, 1983). Given their pragmatic objective, these experiments may also be characterized as utilitarian experiments. To see why non-statistical issues intrude into the discussion of the role of NHSTP in empirical research, it is necessary to consider the nature of the utilitarian experiment.

21. The Differences Between the Utilitarian and Theory-corroboration Experiments

It may be recalled from Table 1 that experimental data in the theory-corroboration experiment are at increasing deductive distances from the experimental, research and substantive hypotheses. As may be seen from Table 9, the same is not true of the utilitarian experiment for the following reason. Given the specificity of the objective, the choice of the independent and dependent variables in the utilitarian experiment is restricted by the research objective itself. This, in turn, determines the experimental and research hypotheses. Consequently, the statistical and substantive hypotheses are indistinguishable.

Table 9. The logical relations among the to-be-investigated phenomenon, pragmatic, research, and experimental hypotheses of the utilitarian experiment


What Is Said At The Level Concerned


To-be-investigated phenomenon

A dissatisfaction with students' current understanding of statistics.

Substantive (pragmatic) hypothesis

Method E is more effective than Method C.
[P9.1]

Research Hypothesis

If [P9.1], then Method E produces better understanding than Method C.
[P9.2]
Experimental Hypothesis
If the consequent of [P9.2], then students taught with Method E have higher scores than those taught with Method C.
[P9.3]
'Statistical Alternative Hypothesis'
If consequent of [P9.3], then H1.*
[P9.4]
Sampling Distribution of H1

If H1, then the probability associated with a difference between Methods E and C as extreme as 1.729 standard error units from an unknown mean difference is not known (assuming df=19).

[P9.5]

Sampling Distribution of Ho

If Ho,Ü then the probability associated with a difference between Methods E and C as extreme as 1.729 standard error units from a mean difference of zero is 0.05 in the long run (assuming df=19).

[P9.5']

*H1 =

mean of Method E > mean of Method C.
ÜHo =
mean of Method E mean of Method C.

Additional differences between the utilitarian and the theory-corroboration experiments have been shown in Table 10. These differences may be used to understand, as well as to answer, some of the criticisms of NHSTP. To begin with, it has been noted that the impetus of the utilitarian experiment is primarily, if not exclusively, to find the solution to a practical problem (e.g., students' poor understanding of statistics; see Row 1 in Table 1). That is, the role of a theory is minimal, if there is one at all, in this kind of experiments (hence, the 'atheoretical' characterization in Row 4).

Table 10. Some differences between the agricultural (utilitarian) model and theory-corroboration experiments



Agricultural Model

(Utilitarian)

Theory Corroboration

1
Impetus
To solve a practical problem; reflexive of data collection

To explain a phenomenon; independent of data collection

2
Subject Matter
The practical problem involving observable events
Unobservable hypothetical entity and its theoretical properties

3

Consequence of Research
Take a particular course of action; closure of investigation

Accept tentatively, revise or reject the theory; no closure to the investigation

4

Role of Theory
Atheoretical
To-be-test theory explicitly stated; used to guide experimental design

5

Substantive Question
'Is the treatment effective?'

'How effective is the treatment?'

'Why does the phenomenon occur?'

6
Experimental Hypothesis
The practical question itself
Qualitatively different from the to-be-assessed substantive hypothesis

7

Experimental Manipulation
The to-be-assessed efficient cause itself

Different from the to-be-explained phenomenon

8

Dependent Measure
The practical problem itself
Different from the to-be-explained phenomenon

9

Statistical Significance

To indicate that the explanation of data in terms of chance variations can be ruled out at the a level.

To indicate that the explanation of data in terms of chance variations can be ruled out at the a level.

10

Effect
Substantive efficacy (i.e., the consequence of an efficient cause)

The difference between the means of two conditions (i.e., the consequence of a formal or a material cause)

11

Ecological Validity
Necessary
Irrelevant, may even be detrimental

Suggestive of this difference is the fact that, whereas unobservable hypothetical entities or processes (e.g., the language processor) are the concerns of the theoretical endeavor in the theory-corroboration experiment, the subject matters of utilitarian experiments are observable activities or events (e.g., students' test scores; see Row 2). The result of the utilitarian experiment is used to guide a particular course of action (e.g., whether or not to adopt the new method of teaching; see Row 3). Experimental data in the theory-corroboration experiment, on the other hand, are used to assess whether or not there is evidential support for an explanatory substantive hypothesis (see Row 3). No pragmatic course of action follows. Nor is any practical problem solved as a result of the theory-corroboration experiment.

The experimental manipulation in the utilitarian experiment is the to-be-assessed efficient cause itself (e.g., the new method of teaching versus the traditional teaching method; see Row 7). However, the independent variable used in the theory-corroboration experiment is not an efficient cause. For example, the presentation of a kernel or a negative sentence does not shape or constrain subjects' behaviour in the way a teaching method may shapes students' learning. In presenting kernel and negative sentences, the experimenter provides the hypothetical linguistic processor different contexts or environments in which to exhibit its theoretical properties. In other words, the independent variable in the theory-corroboration experiment is either a formal or a material cause, not an efficient cause.

22. 'Effect' - Vernacular and Technical Meanings

The contrast between the independent variable as the efficient cause in the utilitarian experiment versus its being the formal (or material) cause in the theory-corroboration experiment has important implications on how 'effect' or 'effective' is understood in the context of NHSTP. 'Effect' is used in its vernacular sense in the ambiguity-anomaly and the insensitivity to effect size criticisms of NHSTP. This is also the sense assumed (as well as congenial to) the utilitarian experiment (see Rows 5 and 10 in Table 10). This is understandable in view of the fact that the experimental manipulation itself is substantively efficacious (e.g., methods of teaching). However, this does not mean that it is justified to do so when the independent variable is not an efficient cause (e.g., sentence type). What is important is that it is also not justified even when the experimental manipulation consists of two efficient causes, but for a different reason.

To adopt the vernacular meaning of 'effect' is to use a statistically significant result to do something more than rejecting chance influences as an explanation. It is to assert that the research manipulation is the explanation (see also 'The Specificity of H1 and Related Issues' section below). However, this assumption is justified only to the extent that the inductive conclusion validity is assured. In fact, as has been noted earlier in the 'Sample Size-Significance Dependence Problem Revisited' section, questions about a statistically significant result arise because there are doubts about the inductive conclusion validity. More important, these questions are not statistical ones. Consequently, it is doubtful that specifying the effect size or determining the confidence interval estimate would allay the non-statistical concerns that underlie the reservations about the statistically significant result.

Recall that H0 is a statement about the consequence of chance influences on data collection. 'Effect' at this level of discourse refers to the difference between the means of two data collection conditions. The NHSTP concern is whether or not the difference is large enough for the rejection of the explanation in terms of chance influences. This technical meaning of 'effect' is different from its vernacular meaning. It does not implicate any assumption of efficacy. More important, by itself, NHSTP does not identify the reason for the sufficiently large difference that leads to the 'statistically significant' decision. Nor should there be any reason to expect an answer coming from NHSTP when the issues implicated are nonstatistical ones.

In sum, critics' concern about the effect size may be represented by the questions tabulated in the left-hand column of Table 11 (see Rosnow & Rosenthal, 1989). These questions are asked because 'effective' is interpreted in its vernacular sense. However, Question [PV-2] does not directly lead to [PV-3] or [PV-4]. It is necessary to provide an independent set of criteria outside the domain of statistics to justify asking Question [PV-3] or [PV-4] in conjunction with Question [PV-2] (see Section 19, "The 'Effect Informs Impact' Claim Revisited"). Such a set of criteria is not available.

Table 11. Different sets of research questions pertinent to practical validity (PV) and conceptual rigor (CR) for the utilitarian and theory-corroboration experiments, respectively


Practical Validity Concerns

(Utilitarian Research)


Conceptual Rigor Concerns

(Theory-corroboration Research)


The independent variable is the efficient cause.


The independent variable is the material or formal cause.

[PV-1]
Is Treatment T effective?
[CR-1]
Is Treatment T effective?

[PV-2]

How effective is Treatment T?

[CR-2]

Is the independent variable a valid choice?

[PV-3]

How impressive is Treatment T?


[CR-3]

Do the data warrant the acceptance of Theory K which underlines the choice of the dependent variable?

[PV-4]

Is Treatment T important?
[CR-4]
Is the implementation of the independent variable valid?


[CR-5]

Does the study have hypothesis validity?

Suppose that the technical meaning of 'effect' is adopted in discussing NHSTP. Although Question [CR-1] is literally the same as [PV-1], it leads to an entirely different set of questions relating to the difference between two data-collection conditions brought about by the experimental manipulation. It may be seen readily that, with the exception of Question [CR-5], these are questions about the data-collection conditions, particularly the inductive principle that underlies the experimental design.

23. Power Analysis

The power of a statistical test has recently become an important consideration in the assessment of empirical studies in psychology. Cohen's (1987) power analytic approach to empirical research has the following themes. First, if Phenomenon P exists, its effects must be detectable. Second, the evidence for the truth of a substantive hypothesis about Phenomenon P is the detectability of the effect envisaged in the hypothesis. Third, the substantive hypothesis is represented by H1 in NHSTP. Fourth, to detect the effect is to obtain statistical significance (i.e., to accept H1 by rejecting H0). Hence, statistical significance is indicative of the truth of H1 or the fact that Phenomenon P exists. These four inter-connecting themes may collectively be identified as the existence-detectability-significance thesis. For this reason, it is important for power analysts to know the a priori probability of obtaining statistical significance. That a priori probability is the power of the statistical test (Cohen, 1987; see also Mosteller & Bush, 1954).

The Type II error is assumed by critics to have real-life consequences. Hence, NHSTP users are faulted for ignoring it as a result of their exclusive obsession with the Type I error. With the advent of power analysis, the Type II error can now be controlled by specifying the level of statistical power desired for the investigation. This is possible because the power of a statistical test is (1 - ), where is the probability of committing the Type II error. The value of can be controlled by setting the level of the power.

That power analysis is currently well received is understandable in view of the facts that critics are convinced that NHSTP is problematic and that power analysis is presented as a remedy for the difficulties of ambiguity and anomaly attributed to NHSTP. However, if the criticisms of NHSTP themselves are debatable, it should become easier to consider power analysis in a more judicious way. There are good reasons to question the existence-detectability-significance thesis of power analysis.

Consider its first theme, namely, that if H1 is true, there is a detectable effect. This theme is contrary to the fact that the tenability of some hypotheses depends on not rejecting H0 (i.e., not detecting any effect, in the parlance of power analysis). An example is Schneider & Shiffrin's (1977) study of automatic detection. The third theme of the thesis that H1 is the substantive hypothesis is debatable in view of the quartet of hypotheses identified in Table 1 and the discussion in Section 12, 'H0, Data and Chance Influences.' Consequently, all power analytic assertions based on identifying H1 with the substantive hypothesis are questionable.

The detectability of the effect is equated with statistical significance in the second theme of the existence-detectability-significance thesis. This makes explicit that an implicit assumption in power analysis that NHSTP is not different from the theory of signal detection procedure (TSD). An examination of this NHSTP-TSD affinity assumption reveals additional conceptual difficulties in power analysis.

24. The NHSTP-TSD Affinity in Power Analysis

Indicative of the NHSTP-TSD affinity envisaged in power analysis are assertions like "Since effects are appraised against a background of random variation" (Cohen, 1987, p. 13), and "[the said appraisal consists of] detecting a difference between the means of populations A and B ... " (Cohen, 1987, p. 6, emphasis in italics added). At the level of rationale, the appeal is made to Neyman & Pearson's (1928) emphasis on the posterior probability. It is believed that researchers first determine what a sample statistic is (e.g., the sample mean). They then ask (or wish to ask) what the probability is that the sample has been selected from Population P with parameter u (see Cohen, 1994). An appeal to the a posteriori probability in this "from sample statistic to population parameter" manner is also found in a TSD analysis.

It is recognized in TSD that an observer's response bias is a function of the prior odds (viz., the probability of the noise event to that of the signal event) and the payoff matrix (i.e., the costs for committing errors and the gains due to making correct detection). Something very similar is suggested in power analysis. Specifically, it is suggested that the placement of the decision axis used to make the statistical decision should reflect a balance struck between statistical power and a (Cohen, 1987, p. 5). This is achieved by taking into account the ratio of the probability of the Type II error to the probability of the Type I error. Researchers are further urged to pay attention to "... the relationship between n and power for [their] situation, taking into account the increase in cost to achieve a given increase in power ..." (Cohen, 1965, p. 98; Cohen's emphasis in italics).

25. Issues Raised by the NHSTP-TSD Affinity

A correspondence between two sets of descriptive terms becomes obvious if the affinity between NHSTP and TSD is recognized. Of particular interest is that between statistical power and hit rate. It renders questionable the following assertion,

The power of a statistical test is the probability that it will yield statistically significant results. (Cohen, 1987, p. 1, emphasis in italics added) [Quote 2]

26. Statistical Power - A Conditional Probability

The Type I error is made when the researcher rejects a true Ho; this is analogous to committing a false alarm in TSD. Power analysts use [H1 True] as a sub-column heading in the upper left panel of Table 12. The Type II error is committed when the researcher fails to reject H0 when H1 is true. The logical complement of Type II error (viz., rejecting H0 when H1 is true) in NHSTP is equivalent to hit in TSD (see the upper right panel). Note that a hit in TSD refers to a "Yes" response contingent on the presence of a signal event. That is, a hit is a characterization of the observer's behavior, given that the signal is present. It says nothing about the signal event per se. It follows that the hit rate in TSD is a conditional probability, namely, the probability of an observer's saying "Yes" when a signal event indeed occurs. In other words, the hit rate says nothing about the exact probability of the presence of a signal event.

Table 12. The correspondence between some concepts (upper left) and their probabilities (lower left) in NHSTP and concepts (upper right) and their probabilities (lower right), given the NHSTP-TSD affinity in power analysis.

Upper Panel

NHSTP Concepts




TSD Concepts


Decision

State of Affairs


TSD Response

State of Affairs


H0 True

H0 False

[H1 True]



Noise

Signal
"Not Reject"
Correct acceptance
Type II error

"No"

Correct rejection
Miss
"Reject"
Type I error
Correct rejection

"Yes"
False alarm
Hit







Lower Panel

NHSTP Concepts




TSD Concepts


Decision

State of Affairs


TSD Response

State of Affairs


H0 True

[H1 True]


Noise

Signal
"Not Reject"
p(Correct acceptance)
p(Type II error) = beta

"No"

Correct rejection rate

Miss rate

"Reject"
p(Type I error) = a
Power = (1 - beta)

"Yes"
False alarm rate
Hit rate







At the same time, as may be seen from the two lower panels of Table 12, the TSD analog of statistical power is the hit rate. Hence, the statistical power is a conditional probability (see also Chow, 1991c). That is to say, knowing the power of a test (a conditional probability) is not knowing the probability of obtaining statistical significance (an exact probability). More important, given the NHSTP-TSD affinity, the statistical power index says something about the researcher, not H1, in much the same way the hit rate says something about the observer, not the signal event. In short, statistical power does not (and cannot) enlighten us as to the probability of obtaining statistical significance.

27. Statistical Power - A Misleading Sense of Efficacy

An efficacious capability is attributed to the statistical procedure in [Quote 2]. It suggests that statistical significance is reached by virtue of the numerical index, statistical power (see the emphasis in italics [Quote 2]). This assertion is misleading because, at the level of statistics, statistical power simply refers to the cumulative probability over a range of parameter values (viz., all values that are as extreme as the critical values of the test statistic). No efficacy of any sort is implicated at this level of discourse. A nonstatistical theoretical justification is required if an efficacious capability is attributed to statistical power. As no such justification is offered, it is only proper not to attach any extra-statistical meaning to the term, statistical power.

There is no a priori reason why the decision to reject Ho in the event that it is false should not simply be called Type II correct decision. Power analysis may not be so readily accepted had a non-evocative term like not-fl been used instead of power. Perhaps an excess and unwarranted meaning is attributed to a conditional probability as a result of its being labeled with the evocative term, power, a connotative meaning of which is being efficacious. The same is also true of statistical significance.

28. Graphical Representation of Statistical Power, Effect and NHSTP

It is taken for granted in the discussion so far that the concept, statistical power, is valid. The validity of power analysis becomes more questionable if there are reservations about the validity of statistical power itself. That [Quote 2] is inconsistent with statistical power being a conditional probability is one such reservation. There are additional reservations.

29. Two Levels of Abstraction -- Statistical Significance and Statistical Power

Consider the assertion, 'A salutary effect of power analysis is that it draws one forcibly to consider the magnitude of effects (Cohen, 1990, p. 1309). This assertion is made because of the functional relationship between statistical power and effect size (given n and a) envisaged in power analysis. This functional relationship is readily seen from Panels A and B of Figure 1. Before proceeding any further, it must be noted that Cohen (1965, 1987, 1992a, 1992b) does not use any graphical representation when he discusses statistical power, effect size or the functional relationship between the two. Nonetheless, Figure 1 is used for ease of exposition. Its use is justified by the fact that it is consistent with how d and statistical power are defined in power analysis.

FIGURES UNAVAILABLE IN THIS VERSION

Figure 1. The grapical representation of two effect sizes (Panels A & B), and the corresponding differences between two mens in raw-score units (Panels C & D), as well as in standard error units (Panel E)

The x-axis in both panels represents population scores (as stipulated by how d is defined in [Eq. 2] above). The left and right distributions in either panel represents the control and experimental distributions, respectively. The effect size is represented by the distance between the two distributions, and statistical power is represented by the area shaded with slanting lines. To power analysts, Panels A and B represent two situations in which the desired effect is larger in Panel B than in Panel A, and Panel B represents a more powerful test than Panel A. Of interest is whether or not research manipulations that are expected to be differentially efficacious would have different impact on NHSTP.

30. H0 and Research Manipulation Efficacy

The pair of population distributions in the 'small effect' situation (viz., Panel A in Figure 1) gives rise to the lone sampling distributions of the difference depicted in Panel C of Figure 1. Similarly, the pair of populations distributions in the 'large effect' situation brings about another lone sampling distributions of the difference (i.e., the one depicted in Panel D of Figure 1). The two sampling distributions of the difference in Panels C and D have the same standard error of the difference in the present example. However, the two sampling distributions cover different parts of the difference between two means continuum in raw-score units (viz., from -2.5 to 4.5 in Panel C versus from -0.5 to 6.5 in Panel D).

Consider the numerator used in calculating the test statistic, t. It is often written as (Mean 1 - Mean 2). However, it is really a short-hand form for [(Mean 1 - Mean 2) - (_1 - _2) = 0]. As has been noted before, the (_1 - _2) component is left out when it is numerically equal to 0 (see Kirk, 1984). The distribution in the top panel of Figure 2 represents a sampling distribution of the difference for a situation in which (_1 - _2) = 0. That is, the mean difference of the sampling distribution of the difference between two means is zero.

FIGURES UNAVAILABLE IN THIS VERSION

Figure 2. The sampling distribution of the difference in raw-score units when the mean difference is 0 (top panel), 1 (middle panel) and 3 (bottom panel)

Power analysts suggested that the desired difference, (Mean 1 - Mean 2), may be 3.0 (or any definite value; e.g., 1), rather than 0. The numerator now becomes [(Mean 1 - Mean 2) - (_1 - _2) = 3.0] or [(Mean 1 - Mean 2) - (_1 - _2) = 1.0] in such an event. That is, the mean difference of the sampling distribution of the difference implicated in NHSTP is 3.0 (or 1.0), and it is graphically represented in the bottom (or middle) panel of Figure 2. The three sampling distributions in the three panels of Figure 2 have the same standard error of differences, but different values for the mean difference (viz., 0, 1 and 3.0). They represent the sampling distribution under Ho in three different situations. Specifically, the bottom panel represents a research manipulation expected to be more efficacious than the one depicted in the middle or top panel.

Represented on the x-axis of the graphical representation in any panel of Figure 2 is the range of possible values of the difference between two means. In other words, the three panels in Figure 2 collectively show that the difference in the expected efficacy of the research manipulation is represented by the spatial displacement of the sampling distribution of the differences between two means along the continuum of all possible values of the difference between two means. This state of affairs is different from the impression conveyed by Panels A and B in Figure 1.

In carrying out NHSTP, only one sampling distribution is used (viz, the one contingent on H0 being true). Moreover, the researcher uses a standardized form of the sampling distribution depicted in either Panel C or D of Figure 1 (viz., the z or t distribution; see Siegel, 1956). That is, regardless of the mean difference in raw-score units, the standardized representation of the to-be-used sampling distribution of the difference remains the same (viz., Panel E in Figure 1). More important, the location of the decision axis vis-[daggerdbl]-vis the mean of the sampling distribution of the difference remains unchanged for the same a level. It follows that the outcome of NHSTP is not affected by the desired effect or expected efficacy of the research manipulation.

Figure 1 shows that two distributions of population scores converge on one standardized distribution via a lone sampling distribution of the test statistic. Panel A or B in Figure 1 shows that it takes two population distributions to depict statistical power, whereas Panel E shows that only one sampling distribution is used to depict NHSTP. Moreover, two different levels of discourse are implicated in Panel A (or B) and E. This demonstrates that it is impossible to represent graphically statistical power without misrepresenting NHSTP. It casts doubts on the validity of the concept, statistical power.

Some important points may now be summarized. First, no distribution based on H1 is implicated in NHSTP (see Panel E of Figure 1). Second, the mean difference in raw-score units of the sampling distribution of difference reflects the theoretical difference between two population means. When expressed in terms of the raw-score unit, this difference is graphically represented by the spatial displacement of the sampling distribution on the difference between two means continuum (see the three panels in Figure 2).

Third, it is not possible to represent graphically the conditional probability, statistical power, if the rationale of NHSTP is properly represented with a single sampling distribution of the difference between two means. Fourth, the desired effect of the research manipulation (in the technical sense of the word) has no impact on NHSTP because the to-be-employed sampling distribution is standardized (e.g., in the form of the appropriate t distribution) before being used to make the 'chance versus non-chance' decision.

31. The Specificity of H1 and Related Issues

For non-power analysts, 'Type II error' in the upper left panel of Table 12 refers to the error committed when a false H0 is not rejected (i.e., ignore the [H1 True] column heading). No mentioned is made of H1 in this definition. It may be recalled from the lower panel of Table 2 that H0 and H1 are mutually exclusive and exhaustive. This is emphasized in Table 7 by depicting that H0 is the implication of chance influences, and that H1 is the implication of some ill-defined non-chance influences. It follows that, while H0 and H1 are mutually exclusive and exhaustive alternatives, 'H0 False' is not synonymous with 'H1.'

Defining 'Type II error' in terms of' 'H0 False' instead of [H1 True] in the upper left panel of Table 12 helps to maintain the distinction between inductive conclusion validity and statistical conclusion validity. Specifically, while NHSTP is used to decide between chance influences and non-chance influences (see Tables 2 and 7), inductive reasoning is employed to identify the non-chance factor involved (see Table 5). Also important is that H1 is numerically non-specific (see [P1.5] in Table 1).

In order to defined power, 'Type II error" is defined in power analysis as the error committed in the event that H1 is true. That is, it is necessary to use the [H1 True] heading in the upper left panel of Table 12. Moreover, H1 is given a specific non-zero numerical value in power analysis. This changes effectively the conceptual meaning of H1 from being an implication of non-chance influences to being the consequence of a specific efficient cause. This is reminiscent of the consequence of using 'effect' in its vernacular sense discussed in Section 22, "'Effect' - Vernacular and Technical Meanings.' Consequently, H0 and H1 are no longer mutually exclusive and exhaustive in the power analytic account of NHSTP. More important, in making the meaning of H1 numerically specific, power analysts may have eschewed the distinction between the two types of internal validity. NHSTP is given the additional role that should be played by inductive logic.

The power analytic practice of making H1 numerically specific is consistent with the Multiple-H1 Assumption view that there are, in fact, multiple numerical alternatives to H0 (Neyman & Pearson, 1928; Rozeboom, 1960). However, this assumption should have no bearing on NHSTP, as may be recalled from the 'H0 and Research Manipulation Efficacy' discussion in Section 30. Why is there the emphasis on multiple numerically specific H1's? The answer may be the fact that the term 'alternative hypothesis' is also used in another sense, albeit at a different level of discourse.

Given any to-be-explained phenomenon, there are alternative explanatory theories at the conceptual level (Popper, 1968a/1959, 1968b/1962). This state of affairs may be characterized as the Reality of Multiple Explanations view in subsequent discussion. In actual fact, different psychologists often explain the same phenomenon with various substantive hypotheses. Moreover, diverse hypothetical structures or functions are postulated in these competing theories.

For example, some psychologists prefer Fillmore's (1968) case grammar or Yngve's (1960) 'Depth' model to Chomsky's (1957) transformational grammar. These three substantive hypotheses lead to different research and experimental hypotheses ([daggerdbl] la the schema depicted in Table 1). As these experimental hypotheses may implicate different independent and dependent variables in diverse experimental situations, they lead to qualitatively different H1's. The distinction between the Multiple-H1 Assumption and the Reality of Multiple Explanations views depicted in Table 13 can be used to defend NHSTP against the Multiple-H1 Assumption critique of NHSTP.

Table 13 The distinction between statistical alternative hypothesis and alternative explanatory hypothesis


Multiple-H1 Assumption


Reality of Multiple Explanations

[a] H1:

(unegative - ukernel) < 0
[i] H1:
(unegative - ukernel) < 0

Ho:

(unegative - ukernel) = 0

Ho:

(unegative - ukernel) = 0

[b] H1':

(unegative - ukernel) = -3
[ii] H1':
(uca - ua) > 0

Ho':

(unegative - ukernel) = 0

Ho':

(uca - ua) = 0

[c] H1'':