| |
The
Development of the DISC Pre-School Screen (DPS)
ABSTRACT
The DPS is a first stage screen designed for the
detection of developmental delay in children from 5 to 52
months (and up to 59 with proration of scores). The test
was developed using Item Response Theory techniques. The
development of the test is described and classic
psychometric data are reported along with data on the
Item Response Theory parameter estimates. Each child is
administered 12 items, chosen on the basis of the child's
age in months. Corrected split-half reliability is .77
with a standard error of measurement of 1.2 items. Design
features which maximise discrimination at the decision
point suggest that these are lower bound estimates on
precision. Scores on the DPS correlate -.68 with a single
question asking about parental concern with development
and -.64 with the same question asked of pre-school
teachers.
The Diagnostic Inventory for Screening Children (DISC)
was developed in the early 1970's as a test allowing
"second stage" screening of children referred
for developmental assessment (Amdur, Mainland, &
Parker, 1988). The concept of a second stage screen came
from the demand for preliminary assessment of children
who had been already been identified as possibly delayed
and referred for further assessment. At the Child and
Family Clinic at Kitchener-Waterloo Hospital it made
sense to do this second stage screening before a complete
multi-disciplinary assessment because of the number of
pre-schoolers being referred.
The DISC proved popular in South-Western Ontario with a
number of agencies including public health, pre-schools
and infant development programmes. As a growing number of
agencies began to use the test we discovered a somewhat
disconcerting phenomenon. Many agencies were choosing to
use the DISC as a first stage screen for children for
whom there was no reason to suspect delay. While we were
reasonably confident of the appropriateness of the DISC
for the task of second stage screening of referred
children, we were less certain of the appropriateness of
the DISC for first stage screening. Moreover, we were
confident that we could produce a better first stage
screen than the DISC using elements of the DISC.
We interviewed agency staff who were using the DISC as a
first stage screen, and they indicated that they were
uncomfortable with other instruments, usually focusing on
the Denver Developmental Screening Scale (Frankenberg and
Dodds, 1964). Many workers indicated that their agency
policies required them to use the Denver, but that they
rarely saw children who were delayed enough to produce
abnormal results on the Denver. This was true even when
they were confident that many of the children that they
had assessed were truly delayed. They found that the DISC
provided them with data that supported their impressions
of children's delays more frequently than the Denver, and
that DISC data were acceptable to agencies receiving
referrals.
This impression was supported when we examined some data
we collected in a validity study using the DISC (Parker,
Mainland and Amdur, 1990). In this study, 40 children who
had been referred to treatment agencies were assessed
with the DISC, Denver, and either the Stanford-Binet
(Terman & Merrill, 1972) or the Bayley Scales of
Infant Development (Bayley, 1968). These children had
been referred with spina bifida, cerebral palsy, speech
delays, behaviour problems, emotional problems, or
concerns about the family environment. Among these
children, it was clear that the DISC was much more
sensitive to delays than the Denver. The Denver produced
"Abnormal" results for only 8 of the 40
children (20%), and a further 16 were given
"Questionable" results. Seven of the 29
children assessed with the Stanford-Binet had IQs less
than 70, and the Denver was largely identifying these
children with retardation syndromes, and ignoring
children with less profound or more specific delays. The
DISC results showed that 31 of the children had two or
more scales with "Probable Delay", and four
more children had one scale with a "Probable
Delay". The DISC results were quite congruent with
the clinical judgements of the agency workers, suggesting
that the DISC was not only more sensitive to delays than
the Denver, but also at least as specific.
With these results in hand, and the clear local demand
for a primary screen that had properties like the DISC,
we decided to develop a primary screening device based on
the DISC. Like the DISC, the DISC Pre-school Screen (DPS)
is a test developed in clinical settings to meet a
specific clinical need.
Development of the DISC Pre-school Screen
Once we had decided to develop a first stage screen, the
obvious first step was to exploit the items and data
available from the DISC. Access to the DISC item pool
gave us a strong head start in developing a first stage
screening test. We had at our disposal a pool of 216
standardised and normed items covering eight major
domains of child development. We also had easy access to
a network of agency workers who provided suggestions and
support. The strong support of agency staff familiar with
the DISC allowed us to experiment with the structure of
the DISC Pre-school Screen (DPS) and to try out our
successive drafts with minimal delay.
The first draft of the DISC Pre-school Screen (DPS 1.0)
was prepared by selecting seven items from each of the
eight DISC scales (total of 56). We selected items that
required a minimum of equipment for administration. Items
were ordered so that a child would be given one item from
each scale in a total administration of eight items.
Start points were chosen according to the start points
used on the DISC. Minor issues of format and preliminary
estimates of validity were addressed in a study of 40
children who were given both DISC and DPS 1.0
administrations (Butler, 1985). A modified first draft
was prepared and circulated to agencies that were already
using the DISC. We asked workers familiar with the DISC
to examine the DPS 1.0, try it out and offer their
comments. At the same time, we applied for and were given
a grant from the Hospital for Sick Children Foundation to
develop the DPS.
By the fall of 1988 we had data from 412 screen
administrations, courtesy of agencies that had used the
untried screen in trial projects and of master's thesis
research by Sharyn Pope at the University of New
Brunswick. The sample has many unknown characteristics,
because it was a sample of convenience.
Item response theory was the primary psychometric model
for the development of the DPS (e.g. Hambleton and
Swamindthan, 1985). This family of statistical models
allows test development to proceed based on the
relationship between item performance and test scores,
largely independent of the composition of a particular
sample of test takers. Any given test is assumed to be a
sample of items measuring some construct (referred to as
a "latent trait"). The psychometric
characteristics of the test are a function of the
properties of the items included in the test. Item
characteristics can be described using one, two, or three
parameters which define a function relating the
probability of passing an item to the ability of the test
taker. The precision of the function is maximised by
adjusting the three parameters using maximum likelihood
estimation techniques. In practice, a computer programme
adjusts the three parameters in an iterative fashion to
find the values that have the highest probability of
reproducing the sample data.
A three parameter item response model estimates: 1) the
difficulty of an item (the test score that corresponds to
a 50% pass rate), 2) the discrimination of an item (a
measure of how quickly the probability of passing an item
rises from zero to one as a function of the ability of
the test-taker) and 3) the pseudo-guessing level (the
probability of a test taker passing an item beyond his or
her ability). Pseudo-guessing is most relevant to
multiple choice tests where a subject can easily be
correct by chance. For all practical purposes,
pseudo-guessing is zero in a developmental test, so we
chose to set the third parameter equal to zero and
concentrate only on discrimination and difficulty -- a
two parameter model.
The final item response model will describe the
probability of passing an item in a two-parameter model
as function in which the difficulty of the item and the
discrimination of the item are constants and the ability
of the subject is the only variable. Ability is a latent
trait that must be estimated concurrently with the
discrimination and difficulty parameters in the iterative
estimation procedure. Rather than use this classic
two-parameter model, we chose a variant of the model. We
chose to substitute age for the latent trait of subject
ability. This modification embeds a validity component
into the model, and reduces the number of estimated
parameters by one.
The use of an eight-item screen for the DPS 1.0 proved to
have been an unfortunate choice. The span of only eight
administered items did not provide enough information for
a detailed statistical analysis. Many items were found to
be misplaced according to the two-parameter model, so the
Trial 1.0 data were inadequate for estimation under the
two-parameter model, and questionable for many classical
analyses.
When the DPS 1.0 was used with children eight items were
administered to each child. The eight items to be
administered to a child were chosen on the basis of the
child's age in months. In order to test the adequacy of
the start points, the relationships between age and DPS
1.0 score (out of eight) was examined. If the start
points were appropriate there should be no significant
correspondence between age and DPS 1.0 score. This is
because we planned that the use of age-dependent start
points would partial out the effects of age from the
scores and the interpretation of test scores could be
independent of age. In fact there were significant
effects for both the Trial 1.0 (Chi-square = 517.0)
It was apparent from the analyses of the DPS 1.0 that the
items from the Self Help and Social Skills scales had the
lowest discrimination parameters. Of 10 items with
discrimination parameters under 0.30, seven were from the
Self Help and Social Skills scales, and the other three
were from items with difficulty levels outside the range
of measurement of the scale (and hence with suspect
discrimination parameters). This was consistent with
previous findings Parker, Mainland & Amdur, 1991).
The Self Help and Social Skills scales are
psychometrically sound, but have the lowest reliability,
lowest loading on a single common factor of development
and highest specificity. While these scales have more
than adequate characteristics on their own, they do not
merge completely with the other six scales on the DPS
1.0.
DPS 2.x & 2.1: Development
The DPS 2.0 was developed in two stages. In the first
stage, the DPS 1.0 item set was reduced by deletion of
all Self Help and Social Skills items. Our modified item
response theory two parameter model (i.e. estimating
difficulty and discrimination, but using age instead of a
latent trait) was recalculated based on data derived from
the DISC normative sample for the reduced item set. One
of the advantages of item response theory is that it
provides a means of estimating standard error of
measurement as a continuous function of score. When this
was done, it was possible to identify gaps in the item
coverage and to determine the difficulty level of items
required to fill the gaps.
In the second stage, new items from the six DISC scales
other than Self Help and Social Skills were added to the
draft version of the DPS 2.x and the second stage version
was analysed. Three times as many items as required were
included and assessed in the preparation of the DPS 2.x.
Generally, those items with the best discrimination
parameter were retained. Where differences were small,
items were chosen to minimise clustering of items from
the same DISC scale, and items with little or no
administrative apparatus were chosen when possible. This
produced a 54-item scale which was analysed using Norm
Group data.
Method
Subjects.
All data for this analysis were derived from the DISC
normative sample.
Analyses.
A data set for DPS 2.x items was abstracted from the DISC
normative sample. Following modification, data for DPS
2.0 were obtained in the same way. These data were
subjected to a two parameter analysis using the Norm
Group data. Both DPS 2.x and 2.0 included all the items
from the DPS 1.0 except those from Self Help and Social
Skills, and six new items from the DISC were added, as
described above, they differed only in the start points
for administration. The source scales for the items are
listed in Table 1.
Results
The data for the two-parameter model are listed in Table
1. The mean discrimination parameter for the revised
version is significantly higher than the first version (t
= 2.68, df = 106, p. < .01).
Determination of start points for the DPS 2.x was an
iterative process. It was to administer 12 items to each
child in the DPS 2.x. Because the DPS 2.x is intended for
the detection of delay, it was decided to maximise the
discrimination power of the 12-item sets at a level below
the mean for a given age on the Norm Group data, and as
close as possible to the predicted value of the cut-off
criterion. Most scales maximise discrimination at about
the mean of the sample distribution. The design of our
scale requires that discrimination information be a
maximum at the criterion for detection of delay, which
would be substantially lower than the mean score.
In order to adjust start points, we treated the 54 items
of the DPS 2.x as if they were a 54-item scale with all
items administered to each child. Regression equations
were computed for the 54-item mean score for a given age
in months as a quadratic function of age (R = .995, F =
51923.8, df = 2, 564) and for standard deviation as a
function of age, mean score and mean score squared (RA =
.466, F = 127.6, df = 3, 439). This was not a standard
regression analysis as the values were mean scores
weighted by sample size rather than raw scores. This
technique tends to inflate the FA and RA statistics and
also stabilises the point estimates of the means (which
is why we used it). Using estimated mean scores and
standard deviations, an estimate of the score
corresponding to z = -1.0 (16th percentile) was
determined for each age in months. Using these scores,
start points were chosen so that for each age group, the
seventh item in the 12 item screen had a difficulty level
that corresponded to the estimated z = -1.
This means that a child with an ability one standard
deviation below the mean will have six items easier than
his or her ability level and six items at or above his or
her ability level. This means that the child will get as
many questions possible that provide information about
ability for that child, and the resultant score will be
as precise as the item set allows. By way of contrast, a
child who is at the mean will have about ten items easier
than his or her ability level (this varies with age) and
only about two at or above his or her level. As a result,
there are not as many items providing good information
about this subject and the estimate of score is less
accurate. Thus the 12 items for each age group were
chosen to maximise discrimination among the children
performing at the 16th percentile, and lower precision at
other ability levels.
When the distribution of 12-item DPS 2.x scores based on
DISC normative data was examined, we had achieved the
discrimination characteristics we were seeking. These
data were calculated including only children older than 4
months and younger than 51 months to avoid floor and
ceiling effects. However, despite our best efforts to
reduce the relationship between age and screen score by
choosing the correct start points, the relationship was
still significant (Chi-square = 625.9, df = 540, p. <
.01, Cramer's V = .35) albeit substantially reduced.
There was a small but significant correlation between age
and 12Šitem DPS 2.x score (r. = .197, df = 416, p. <
.001).
On inspection, it could be seen that the regression
equation had produced start points too easy for the
youngest children and too hard for the eldest children (a
regression to the mean phenomenon). Adjustments were made
to the slope of the regression equation (only) to
minimise this effect, and produce the DPS 2.0. When the
revised data were analysed the relationship with age had
been eliminated (Chi-square = 547.8, df = 540, p. is not
significant, Cramer's V = .33) There was no significant
correlation between age and 12-item DPS 2.0 score (r. =
-.044, df = 416, p. is not significant).
Table 2 lists classic psychometric measures for both the
54-item (i.e. as if we had administered every item to
each child) and 12-item-analyses (i.e. as if we
administered only 12 items, with the start point
carefully chosen to match the age of the child) of the
Norm Group data configured as the DPS 2.0. Note the
substantial increase in the reliability of the 12-item
version over the 8-item DPS 1.0 (Spearman-Brown increased
from .58 to .80).
The data in Table 3 indicate the distribution of scores
on the 12-item DPS 2.0 for children from 5 to 50 months.
The cumulative percentages indicate that there is
substantial leeway in the choice of potential cut-off
scores depending on the proportion of children an agency
chooses to refer. If all children with scores of 6 or
less are referred for further testing, these data suggest
that approximately 8% of children from a normal
population would be referred. A criterion of 5 or less
would refer about 6% of tested children. A criterion of 7
or less would refer about 15%. These estimates will
depend on the accuracy of the DISC norm data to predict
DPS 2.0 performance and were to be considered as
estimates until data were collected from a new sample. If
the population of children being screened is a high risk
group of some kind (e.g. low birth weight children), the
expected referral rates will be substantially higher
because the proportion of delayed children is higher.
Given the design of the DPS 2.0, the proportion of
referred children ought to be relatively constant
regardless of age in a uniform population.
Trial of DPS 2.0
We decided to try the DPS 2.0 with large numbers of
children to establish preliminary normative data and some
validity data on the relationship between DPS scores and
two other kinds of measures. Once again, we were given
strong agency support.
Method
Subjects.
Data came from 12 sources throughout Ontario, including a
very large data set from the Elgin-St. Thomas Public
Health Unit, that used the draft screen on a trial basis
for pre-kindergarten screening of about 800 children.
From the perspective of a researcher, the data were
collected in a non-systematic manner. Although the people
doing the data collection were well trained, competent,
and systematic in their approach to using the test, there
was no control over subject selection. One outcome of
this was the odd age distribution of the subjects. There
were only 49 children tested 40 months of age or younger.
There were 592 from 41 to 51 months inclusive, and 300
over 51 months. Thus the youngest group of children was
too sparse for solid statistical analysis.
Analysis.
In analysing the data, the first problem was how to deal
with the very large number of protocols with missing item
data and refused items. We decided to try each of three
strategies and assess the results. When both missing and
refused items were treated as missing data, only 697
cases were available with complete data, and the
corrected split-half reliability was .58. When misses and
refuses were both treated as failures, the corrected
split half reliability was .70 for 941 cases. When misses
were treated as missing data and refuses were treated as
failures, the corrected split-half reliability was .63
for 822 cases.
We chose to treat refused items as if they had been
failed, because doing this increases the internal
consistency. Although the same could be said of items
missed by the examiners, we decided that we could not
justify calling an item not administered by the examiner
a failure by the child, despite the improvement in
reliability. Therefore the scoring convention used in
subsequent analyses is that a missed item is treated as
missing data and a refused item is treated as a failure
to pass the item.
The structure of the DPS 2.0 was based on item response
theory. Specifically a two parameter model was used,
assessing item difficult and item discrimination with
respect to age. Because the sample sizes were relatively
small for such an analysis, it was decided that a
two-parameter model could not be estimated with any
precision. We also noted that the structure of the test
favoured a one-parameter model. The use of a constant
number of items (12) with start points shifting with age
causes the items to be treated as interchangeable units,
differing only in difficulty, not discrimination.
A Rasch single parameter model (i.e. estimating only
difficulty and assuming that discrimination is constant
from item to item) was estimated in each case. Where
numbers of subjects warranted, Rasch difficulty levels
were estimated for each item. In order to complete this
analysis, it was required to break the subjects into
groups with the same start point (who therefore received
the same set of 12 items) and analyse each group
separately. The normal item response theory latent trait
estimate of subject ability had to be used in this case
rather than age, because all children in each group were
of the same age.
Results
The results of the estimates of item difficulty are
listed in Table 4. Note that item difficulty is on a
common scale within a column, but that each column is
scaled differently. It was apparent from the analysis of
item difficulty that some items were consistently
misplaced in order of difficulty on this draft (e.g. item
number 52 which is easier than every item numbered higher
than item 46 and some lower than 46. As a result of this
analysis it was decided to reorder the items as indicated
in Table 4 for use with DPS 3.0.
One impact of this lack of rank order would be to reduce
the internal consistency of the 12Šitem scale collapsed
over specific items. The intent of the DPS is for any
particular item to function as an effective first
(easiest), second, third, and so on up to the twelfth
(hardest) item, depending on the age of the child being
tested. Thus the same item plays twelve different
difficulty roles in the scale depending on age. This can
only work properly if the items are ordered and spaced
properly. It was apparent that the item ordering for
these older items was not correct. For this reason, the
split-half reliability of the scale was computed
separately for each start point (and therefore each
particular set of twelve items) and the weighted mean was
computed using a technique outlined in Hedges and Olkin
(1985) for use in computing mean correlation. For the
younger ages, the sizes of samples with the start point
were often as small as one or two children, so we only
report data from samples with at least 30 children. The
result was a mean corrected split-half reliability of .66
for 765 subjects who were started with items 36 to 43.
It is convenient to compute a confidence interval that
matches the use of the test. With a standard deviation of
2.0, the standard error of measurement becomes 1.16. The
interval from the centre of one item's range to its edge
would be 0.5. (Note that a test score of 6, occupies a
theoretical range from 5.5 to 6.5, a width of one 1.0,
with the assigned score occupying the centre of the
range.) From the centre of one item to the edge of the
range of the next item would be 1.5, which produces a
one-sided 10% confidence interval (given an SEM of 1.16).
Assuming that a score of six or lower is chosen as the
criterion for suspecting delay, if a child produces a
score of six, then the chances that the true score was a
high as eight would be 10% or less under classical test
theory. The confidence interval is one-sided because we
don't care if the score is less than 4 -- it won't change
our interpretation in the least. The same argument can be
made with a criterion of seven of lower: a score of seven
has a 10% or smaller chance of coming from a true score
as high as nine. Thus, whatever criterion is used, a
one-item buffer or "don't know" range should
account for uncertain scores.
The intent for the DPS is that the scores obtained will
be invariant with age. For children from 39 months to 52
months of age, scores were cross-tabulated with age.
These age groups were selected because they have a large
enough sample size, and are within the design range of
ages where there is no need to prorate for age. After
elimination of children with missing items or with
incorrect start points, there were 485 subjects.
Chi-square with 143 degrees of freedom was 198.2, which
is significant at the .01 level. This indicates that
there is a significant relationship of some kind with
age. Cramer's V for this data set is .19, indicating that
the magnitude of the relationship is small, and
Spearman's Rho is -.03, indicating that it is not a
linear trend. Inspection of the table indicates that
performances at three ages (42, 46, and 51 months) were
either better than average (42 months) or worse than
average (46 and 51 months). The distortions were small --
one item or less -- suggesting that examination of the
means would be worthwhile.
Chi-square looks at the distribution of scores using a
non-parametric model. An analysis of variance of these
same data assesses the distribution of scores from a
parametric model and produced an F (13, 469) of 1.88
which is significant at the .05 level. The corresponding
Omega squared is .023.
Both of these analyses make it apparent that the
distribution of scores varies slightly, but significantly
across the 39 to 52 month age span. However, in clinical
use of the DPS the entire range of the score distribution
(0-12) is not used. The DPS was designed to make a
dichotomous decision (i.e. delayed/not delayed) around a
criterion score, with a narrow range of declared
uncertainty. The DPS was also designed to maximise
precision of measurement at the cut-off point, but
sacrificed precision at higher and lower scores. The
correspondences between age and score, based on the
entire distribution of scores are therefore, possibly
irrelevant to the test as it is to be used.
We decided to analyse the data in the form that they
would be used. The data were divided into three
categories: possible indication of a delay, uncertain
indication, and no indication of delay. The
correspondence between these three categories and age was
examined in a separate analysis for each of two cut-off
criteria. In one analysis, all scores below seven (11.49%
of scores) were labelled "possible delay", all
scores equal to seven (6.26%) were labelled
"maybe" and scores above seven were labelled
"no delay" (82.25%). The analysis was repeated
using a "maybe" value of eight with percentages
of 17.75 for "possible delay", 12.53 for
"maybe" and 69.72 for "no delay". The
selection of a "maybe" category only one item
wide was governed by the one-sided 10% confidence
interval of 1.5. For both the seven and eight criteria
there was no significant impact of age (Chi-square with
26 degrees of freedom was 29.6 for a "maybe" of
seven and 26.5 for a "maybe" of eight. Both
values are clearly non-significant, and are, in fact very
close to the expected value of Chi-square in a random
situation. Note that the change from 13 categories
(scores from 0 to 12) to three categories (possible
delay, maybe, no delay) caused a reduction in degrees of
freedom from 143 to 26 for Chi-square analyses with
corresponding increases on average cell size, and the
power of the test to detect a difference.
The distribution of test scores in the age range 39 to 52
is given in Table 5 along with the expected distributions
based on the data from the DISC norms. The differences
between expected and observed distributions are small
enough to be ignored for clinical purposes. Different
agencies may choose to use different criteria for
referral for more detailed testing. These data indicate
that a criterion of six or lower will refer 11.5% of
children from 39 to 52 months of age, and leave about
6.3% of children in an uncertain category. A criterion of
seven or lower will lead to the referral of 17.8% of
children and leave 12.5% of children in an uncertain
category. These distributions are very close to the
expected distributions, suggesting that they will
generalise well to the (untested) younger age groups as
well.
There remained the problem of dealing with children older
than 52 months. These are children for whom we were
unable to find items difficult enough to form the most
difficult items of the 12-item set. However, the
structure of the test is such that should not be a
crippling problem. The most discriminating items on each
testing are the ones of middle difficulty (i.e. roughly
the fifth, sixth, seventh and eighth items). For older
children, there will not be enough of the most difficult
items, but the most discriminating items are available --
albeit as the last few (most difficult) rather than the
middle items.
While new items are being assessed in an attempt to find
difficult items for these children, a strategy of
proration has been attempted as an interim measure. The
appropriate proration values were sought by using the
data from the younger age group to establish criterion
percentiles. One reasonable cut-off score for referral is
a value of 6, (the 11.5th percentile). A score of 7 or
lower will include 17.74% of children, and score of 8 or
lower will include 30.3% of children. Across the age
range from 53 months to 63 months, the 11.5th percentile
changes with age from a score of 6 to a score of 9, the
17.75th percentile moves from 7 to 9, and the 30th
percentile moves from 8 to 9. The curve plotting these
percentiles approximates a decelerating quadratic curve
asymptotic at 9.
Given the decelerating change in score with age it
appeared legitimate to collapse the norms across
increasingly larger age ranges. We decided to move from
the two month intervals at the end of the DPS to a three
month interval for 53 to 55 months and a four month
interval from 56 to 59 months. Given the asymptotic
nature of the curve, it was decided not to prorate after
59 months. The appropriate proration procedure is as
follows: subtract one from the obtained score for
children 53 to 55 months and subtract 2 from the obtained
score for children 56 to 59 months. The resultant score
can be interpreted much as the scores for younger
children, but the percentiles are not exact. A score of 6
or lower is at the 7th rather than the 11th percentile, a
score of 7 is at the 14th rather than the 18th
percentile, and a score of 8 is at the 33rd rather than
the 30th percentile. Given a sample size of 126 children
in this age range, it would be inappropriate to attempt
to refine the proration any more than this.
The correlation between age and the 54-item score
(assuming passing of items before the start point and
failure of items past the end point) was computed. Unlike
the 12-item score, we expect this to be strongly
correlated with age. The relationship is non-linear, and
this is reflected in a significant quadratic component.
Age and squared age have a multiple correlation
(including the linear and quadratic components) with the
54-item score of .95 for 822 children with complete data
ranging in age from 1 to 78 months.
Discussion
We have outlined the main psychometric analyses of the
data collected using the DPS 2.0. The scale performed
much as was predicted using the data from the normative
data set of the DISC. Item order was incorrect, but this
is reasonable to expect on a first attempt, given
problems with item ceilings in the DISC data set.
Although the internal consistency was lower than expected
(.68 instead of .80) the standard error of measurement
was not that far different than expected (1.5 as opposed
to 1.1). Moreover, the changes made in item order should
work to improve internal consistency. Because the change
in item order does not change more than one or two of the
particular set of items administered to a child, the
summed scores are unlikely to change very much.
Without recourse to statistical analysis, it was apparent
that a number of minor revisions would have to be made to
the DPS 2.0. These revisions were primarily aimed at
reducing the likelihood of administrative errors.
A frequent problem was the use of the wrong start point
in administering the test. The most common error of this
type was the use of an item number as if it were the age
starting point for administration (e.g. item #37 for a 37-month-old
child instead of item #33 -- the proper start point).
This was easy to fix by changing the administration form.
Item numbers were eliminated. Another common error was to
fail to administer all of the 12 age-appropriate items to
a child. The revision aimed at this problem was change
the recording of a response. Instead of inserting checks
into three columns, the tester was asked to circle one of
three words for each item, depending on the response of
the child ("Yes", "No", and
"Refuse"). The third common administration
error was the failure to test every element of multiple
element items (e.g. identifies 6 of 8 colours). To
correct this, the response recording was organised in a
way that made omissions more salient.
Having completed these analyses and attendant changes we
produced the DPS 3.0. We were comfortable in accepting
the DPS 2.0 as a standardised, face-valid scale with
reliabilities as reported in the text for the age range
from 40 to 52 months. The success of the DISC normative
data in predicting the performance of the items for older
children was good with respect to the distribution of
scores, and weaker with respect to item order and
reliability. It was reasonable to expect that the DPS 3.0
would be similar for the younger age groups, producing
equally reliable data. The proration to 59 months will
probably be accurate, but a new set of more difficult
items is currently being assessed, so that the proration
is likely to be only a transitional measure.
To this point we have reported minimal data assessing the
validity of the scale (good age correlations). Validity
depended on the construction of the scale (which was a
classically face valid test based on both expert
nomination and statistical selection among nominated
items), and the high item-wise correlation with age. The
high correlation of age with performance is evident in
the high multiple correlation of the 54-item score with
age and squared age.
Reliability and validity of DPS 3.0
As part of the development of the DISC Pre-school Screen,
it was considered important to collect data assessing
validity. With a small budget and the primary use of data
volunteered by co-operating agencies, the scope for
validation instruments was rather limited. It was decided
to produce a brief checklist-type developmental
questionnaire to be completed by a parent as the primary
validation instrument.
A review of the paediatric literature on high risk
pregnancies and deliveries produced a number of potential
questionnaire topics. Other topics were culled from the
questionnaires used by a large number of agencies that
kindly sent us copies of the forms that they use
routinely. Items were drafted with attention to use of
the simplest language possible. Responses were organised
as a Yes/No forced choice with spaces for clarification
or amplification. While the bulk of the items are worded
so that a Yes response is associated with increased risk,
a number are also quite clearly inverted. The criterion
for the valence of the question was clarity of
expression. The final product was organised into six
sections: demographic data, prenatal history of the
mother, birth history of the child, infant history,
childhood history, and family history. Embedded in the
childhood question was a eight part section that asked
the parent to indicate any specific concerns she might
have with the development of the child. This section was
organised to correspond with the eight DISC scales.
Method
Subjects.
A list of agencies using the DISC is maintained for
purposes of sharing new information about the DISC.
Agencies on this list were sent a letter asking if they
would be interested in participating in the development
of the DISC Pre-school Screen. Twenty-four agencies
volunteered to collaborate in the development of the DISC
Pre-school Screen.
In the Fall of 1989, they were sent copies of the DPS 2.0
and (somewhat belatedly) copies of the developmental
questionnaire for parents and for pre-school teachers.
Five of the volunteer agencies sent in completed copies
of the questionnaire with the DPS 2.0 scores that they
had collected, for a total of 62 subjects with
Questionnaires. Not every questionnaire was complete, and
not every child had a score on the DPS 2.0. As a result,
the sample size is different for almost every statistic.
These data are identified as the DPS 2.0 data.
Revisions were made to the DPS and the Parent
Questionnaire following the DPS 2.0 data collection. The
revised versions (including DPS 3.0) were sent to
agencies in late 1989. Eleven agencies supplied data
based on these revised materials. These data are labelled
the DPS 3.0 data.
Although fathers were invited to complete the Parent
questionnaire, all of the submitted forms were completed
by mothers.
Results
Reliability of DPS 3.0
The corrected split-half reliability of the DPS 3.0 was
measured based on 87 children from the DPS 3.0 data,
between 5 and 52 months inclusive, collapsed across age
groups. The value was .77 -- a substantial increase from
the value of .66 found in the DPS 2.0, and very close to
the value predicted from the data derived from the DISC
norms (.80). The standard error of measurement was 1.22,
very close to the DPS 2.0 value of 1.16 and to the value
of 1.1 predicted from the DISC norm data.
Properties of validation instruments.
The parent questionnaire was organised into clusters of
related items. Each cluster of items was treated as a
separate measurement scale by scoring each item as 1 when
the answer indicated risk and as 0 when the item
indicated no risk. The values for reliability are
reported in Table 6 for both the DPS 2.0 and DPS 3.0 data
collections. Reliabilities were computed using a
corrected split-half correlation, because of the
substantial skew in the distributions.
The cluster of items assessing specific developmental
concern was dropped from the second analysis (i.e. using
the DPS 3.0) for two reasons. First, the correlation with
the DPS 2.0 was found to be very low (-.04) and second
the reliability was too high (.95). The latter reason
seems paradoxical until it is recalled that if the eight
different developmental concerns were indeed specific,
there ought to be a low internal consistency.
The reliabilities were generally modest to low. Prenatal
History (r. = .62, .58 in DPS 2.0 and DPS 3.0 data
collections respectively) and Birth History (.63, .52)
values are low enough to suggest caution in trying to
interpret the scale clinically. Infant history is quite
respectable (.80, .86). The drop in reliability for
Childhood History (.82, .43) probably reflects a
typographical error on the second version of the scale
that misalign the question and answer columns, leading to
some confusion among those answering and scoring. Family
History was respectable, but not high (.78, .75).
The parent scales can be compared to the teacher
questionnaire, and the results for relationships between
some single teacher items and the parent scales are
reported in Table 7. The teacher's indication of concern
about the child's delay correlates with both Infant (r. =
.28) and Child History (r. = .37). The teacher's
indication of a medical problem that would tend to
indicate delay correlates with Birth (r. = .41) and
Infant History (r. = .41). A question about social or
family concerns correlates with the Family scale on the
Parent Questionnaire (r. = .49). These results suggest
that the teachers perceptions of delay are compatible
with parental reports of infant and childhood history
(i.e. more recent history) while perceptions of medical
problems implying delay are more congruent with the
parental Birth and Infant history (i.e. more distant
history). Perceptions of social or family problems tally
reasonably well with parental report of family issues,
and little else.
Validity measures.
DPS 3.0 scores show no correlation with age or age
squared between 4 and 53 months (R = .164, df = 2, 84, p.
is n.s.). The distribution of classifications (delay,
maybe, no delay) is not associated with age (Chi-square =
63.0, df = 80, p. is n.s., N = 114), although the data
are sparse for such a large table. Nevertheless, the data
suggest that the DPS 3.0 score is independent of age for
these data (as we hoped it would be).
Correlations of each parent scale with the raw score on
the screen (DPS 2.0 or 3.0 and separately) were computed,
but only for those children who fell between the ages of
4 and 53 months. Other children show ceiling effects
because of the limits of the scale in the current
administration. All correlations were tested as
one-tailed hypotheses. One item answer was also tested
because of its specific interest: the question asking if
the parent believes her child is developing normally.
Results are listed in Table 8.
Parental concern about the child's development (a single
question) is strongly correlated with both DPS 2.0 and
DPS 3.0 scores (r. = -.63, -.68). The small increase is
consistent with the increased reliability of the DPS 3.0
over the DPS 2.0. Childhood history also shows a
consistent correlation with DPS scores (r. = -.54, -.47).
The other values are somewhat inconsistent, as might be
expected based on low reliability, and poor replication
between teacher and parent report. Family History does
replicate well from parent to teacher report, does not
replicate from DPS 2.0 to 3.0.
Data on Teacher items are reported for DPS 3.0 data only.
The teacher's concern for the child's development
correlates highly with DPS 3.0 score (r. = -.64), and
moderately with parental concern (r. = .48).
Parental judgement of delay corresponds well with DPS 3.0
classification (Chi-square = 16.3, df = 2, p. < .0001,
Spearman rho = .58), as does teacher concern (Chi-square
= 26.3, df = 2, p. < .0001, Spearman rho = .52). In 10
of 11 cases where parents expressed concern about
development, the DPS 3.0 indicated delay (Sensitivity =
.91). In 26 of 35 cases where parents indicated no delay,
the DPS 3.0 indicated no delay (Specificity = .76), and
one case fell into the uncertain category. For teachers,
in 18 of 34 cases where teachers indicated delay, DPS 3.0
scores indicated delay (Sensitivity = .62), with 5 cases
falling in the "maybe" range. In 40 of 46 cases
where teachers did not indicate delay, the DPS 3.0
indicated no delay (Specificity = .87).
Discussion
The reliability of the six validity scales derived from
the parent questionnaire is generally respectable,
although the prenatal and birth history scales show
somewhat lower reliability than might be desired in a
clinical scale. This is reasonable, given the variety of
potentially unconnected events associated in these
scales. The correlation between the DPS and the scales is
lower for the more distant history scales and higher for
the more recent history scales with the exception of
Family history.
Relationships between the teacher and parent
questionnaire items allow assessment of their validity.
Not all teacher items support all parent scales. SES and
adjustment to pre-school seem to be unrelated to
developmental issues. Social stressors seem to correspond
from parent to teacher ratings, but also seem to be
unrelated to development.
The three parent scales with the highest reliability and
the most recent history of the child show the highest
correlation with scores on the DPS. Moreover, the
correlation between parent's expressed concern with a
child's development and score on the screen is a
remarkable -.63 (DPS 2.0) and -.68 (DPS 3.0) on the two
editions. The teacher judgements of child development are
also supportive of validity. Specificity of the DPS 3.0
may be better with respect to parental judgement of delay
than with teacher judgement, but sample sizes are too
small to test this possibility.
The data reported in these studies provide quite
reasonable support for the contention that the DPS is a
valid measure of developmental delay in the children we
tested. The children tested were a good cross-section of
children in south-western Ontario who attend pre-schools
with a therapeutic orientation as well a very good
cross-section of children coming to pre-kindergarten
screening clinics. The data were disproportionately
representative of children over 39 months of age. While
we have data on younger children, they are sparse. The
goodness of fit between data on older children and what
was predicted from DISC norms suggests that the DPS for
the younger group will perform as predicted, too.
However, the younger group has not been well evaluated in
these studies.
The validity studies are reasonably persuasive. DPS 3.0
scores are independent of age, but strongly related to
parent and teacher judgements of delay and risk factors
collected from developmental history questionnaires.
A number of tasks lie ahead in the development of the
DPS. It will be important to evaluate the test using
larger numbers of children under three years of age. We
would also benefit from a few more items at both the
youngest and eldest extremes of the test. At present we
can assert with reasonable confidence that the DPS 3.0 is
first stage screen that has proven to be a valid
predictor of possible developmental delay for children
between 39 and 53 months of age. We can also infer with
reasonable confidence that it will prove to be equally
valid for children as young as 4 months and as old as
five years, and that it will be possible to extend the
test both to younger ages and older ages.
It is important not to use this test to rule out delay.
It is intended to be a first stage screen of children
with no other indication of delay. If the DPS shows no
indication of delay, but a parent, teacher or other
professional is concerned about development, a more
detailed assessment is indicated.
References.
Amdur, J. A. & Mainland, M. K. (1984).
The Diagnostic Inventory for Screening Children.
Kitchener-Waterloo Hospital.
Amdur, J. A., Mainland, M. K., & Parker, K. C. H.
(1988).
The Diagnostic Inventory for Screening Children, Second
Edition.
Kitchener-Waterloo Hospital.
Amdur, J. A., Mainland, M. K., Parker, K. C. H. &
Portelance, F. (in press). Mthode d'evaluation
diagnostique du du developpement des enfants.
Kitchener-Waterloo Hospital.
Butler, Janice. (1986). Validation of the Screen for the
Diagnostic Inventory for Screening Children. Unpublished
Batchelor's thesis: U. of Waterloo.
Bayley, N. (1969). Bayley Scales of Infant Development.
New York: Psychological Corporation.
Cadman, D., Walter, S. D., Chambers, L. W., Ferguson, R.,
Szatmari, P., Johnson, N., & McNamee, J. (1988).
Predicting problems in school performance from pre-school
health, developmental and behavioural assessments.
Canadian Medical Association Journal 13931-36.
Frankenburg, W. K., & Dodds, J. E. (1969). Denver
Developmental Screening Test.
Colorado: University of Colorado Medical Centre.
Hedges. L/ V., & Olkin, I. (1985). Statistical
methods for meta-analysis. Orlando, Fl.: Academic Press.
Illman, C. (1987) A validity study of the Diagnostic
Inventory for Screening Children (DISC) using teacher
observations. Unpublished bachelor's thesis. Waterloo,
Ontario: Wilfred Laurier University.
Parker, Kevin C. H., Mainland, Marian and Amdur,
Jeanette. (1990). The Diagnostic Inventory for Screening
Children: Psychometric, factor and validity analyses.
Canadian Journal of Behavioural Sciences 22, 361-376.
Hambleton, Ronald K., & Swaminathan, H. (1985). Item
Response Theory: Principles and Applications.
Kluwer-Nijhoff: Boston.
Terman, Lewis N., & Merrill, Maud A. (1972).
Stanford-Binet Intelligence Scale.
Manual for the Third Revision Form L-M. Boston: Houghton
Mifflin Co.
Web page design by Jerry
Walsh
jwalsh@bigfoot.com
|