CSS Forums - View Single Post

bluesky · #5 Thursday, July 11, 2013

Measurement And Testing

As discussed earlier, measurement is an essential component of the evaluation
process. It is a critical part since resulting decisions are only as good as the data
upon which the data are based. In general sense, data collection is involved in
all phases of evaluation – the planning phase, the process phase and the product
phase. Measurement, however, the process of quantifying the degree to which
someone or something possesses a given trait, normally occurs in the process and
the product phase.
Testing is necessary at certain points and useful at others. Testing can be conducted
at the end of an instruction cycle – semester, term or unit. Such posttesting is
for the purpose of determining the degree to which objectives (formal or informal)
have been achieved, be they instructional objectives or program objectives.
Frequently, prettest or baseline data are collected at the beginning of the cycle.
Prettests serve several purposes, the most important being that knowledge of the
current status of a group may provide guidance for future activities as well as a
basis of comparison for posttest results.
There are a variety of situations where testing is useful. A teacher may administer
tests of entry behaviour to determine whether assumed pre-requisites have indeed
been achieved. A special project designed to reduce dropouts may administer
attitude tests and tests of personality variables such as introversion, aggression
and anxiety in an effort to identify potential dropouts or to better understand
students having difficulties. A school may administer tests of scholastic aptitude
in order to determine realistic achievement goals for students and to assist in the
guidance process.

Data Collection

There are three major ways to collect data:
i) administer a standardized instrument
ii) administer a locally developed instrument
iii) record naturally available data (such as grade point averages and absenteeism)
Depending upon the situation, one of these ways may be most appropriate or a
combination may be required. Collection of available data, requiring minimum
effort, sounds very attractive. There are not very many situations, however,
for which this type of data is appropriate. Even when it is appropriate – that is,
will facilitate intended decision making – there are problems inherent in this
type of data. For example, the same letter grade does not necessarily represent
the same level of achievement, even in two different classes in the same school
or two different schools in the same system. Further, the records, for which the
data are taken may be incomplete and disorganized. Developing an instrument
for a particular purpose also has several major drawbacks. The development of
a ‘good’ instrument requires considerable time, effort and skill. Training at least
equivalent to a course in testing and evaluation is necessary in order to acquire
the skills for good instrument development.
In contrast, the time it takes to select an appropriate instrument (usually from
among standardized, commercially available instruments) is inevitably less than
the time it takes to develop an instrument which measures the same thing. Further
standardized instruments are typically developed by experts who possess the
necessary skills. Thousands of standardized instruments are available which yield
a wide variety of data for a wide variety of purposes. Major areas for which numerous
measuring instruments have developed include achievement, personality and
aptitude. Each of these can be in turn further divided into may subcategories. In
general, it is usually a good idea to find out whether a suitable instrument is already
available before jumping into instrument development. There are situations,
however, for which use of available instruments is impractical or inappropriate.
A teacher-made test, for example, is more appropriate or valid for assessing the
degree to which students have achieved the objectives of a given unit.

Classification Schemes

At this point it must be emphasized that a test is not necessarily a written set of
questions to which an individual responds in order to determine whether he/she
passes. A more inclusive definition of a test is a means of measuring the knowledge,
skills, feelings, intelligence or aptitude of an individual or a group. Tests produce
numerical scores which can be used to identify, classify or otherwise evaluate test
takers. While in practice most of tests are in paper-and-pencil form, there are
many different kinds of tests and many different ways to classify them. The various
classification schemes overlap considerably, and categories are by no means
mutually exclusive. Any test can be classified on more than one dimension.
Response Behaviours
The term response behaviours refers to the way in which behaviours to be measured
are exhibited. While in some cases responses to questions or other stimuliare given orally,
usually, they are either written or take the form of an actual
performance.

A. Written responses

Written tests can be classified as either essays (subjective) or objective and standardized
or locally-developed.

Essay Vs Objective Tests

i) Essays
An essay test is one in which the number of questions is limited and
responders must compose answers, typically lengthy, e.g., ‘identify and
discuss the major reforms in African education during the colonial
period.’ Determining the goodness or correctness of answers to such
questions involves some degree of subjectivity.

ii) Objective tests
An objective test is one for which subjectivity in scoring is eliminated,
at least theoretically. In other words, anyone scoring a given test should
come up with the same score. Examples of objective tests are multiple
–choice tests, true-false tests, matching tests, and short answer tests.

Standardized Vs Locally Developed Tests

i) Standardized Tests

A standardized test is one that is developed by subject matter and
measurement specialists, that is field-tested under uniform administration
procedures, that is revised to meet certain criteria and scored and
interpreted using uniform procedures and standards. Standardization
permeates all aspects of the test to the degree that it can be administered
and scored exactly the same way every time it is given. Although other
measurement instruments can be standardized, most standardized tests
are objective, written tests requiring written responses. Although exceptions
occur, the vast majority of standardized tests have been administered
to groups referred to as the norm group. The performance of a norm
group for a given test serves as the basis of comparison and interpretation
for other groups to whom the test is administered. The score of a
norm group are called norms. Ideally, the norm group is a large, well
defined group which is representative of the group and subgroups for
whom the test is intended.

ii) Locally-developed test

The opposite of standardized test is obviously non-standardized test.
Such tests are usually developed locally for a specific purpose. The tests
used by teachers in the classroom are examples of locally-developedtests.
Such tests do not have the characteristics of standardized tests. A
locally-developed test may be as good as a standardized test, but not
that often. Use of locally-developed test is usually more practical and
more appropriate. A locally-developed test would more likely reflect
what was actually taught in the classroom to a greater degree than standardized
tests.

B. Performance Tests

For may objectives and areas of learning, use of written tests is inappropriate way
of measuring behaviour. You cannot determine how well a student can type a
letter, for example, with a multiple choice test or open question. Performance is
one of these areas. Performance can take the form of a procedure or a product. A
procedure is a series of steps, usually in a definite order, executed in performing
an act or a task. Examples include, adjusting a microscope, passing a football,
setting margins on type writer, drawing geometric figures or calculating sum of
figures in Excel. A product is a tangible outcome or result. Examples of a product
include typed letter, a painting, a poem, and a science project. In either case the
performance is observed and rated in some way. Thus, performance is one which
requires the execution of an act or the development of a product in order to determine
whether or to what degree a given ability or trait exists.

Data Collection Methods

There are many different ways to collect data and classifying data collection
methods is not easy. However, a logical way to categorize them initially is in terms
of whether the data are obtained through self-report or observation.

a) Self Report

Self-report data consist of oral or written responses from individuals. An
obvious type of self-report data is that resulting from the administration
of standardized or locally-developed written tests, including certain
achievement, personality and aptitude tests. Another type of self-report
measure used in certain evaluation efforts is the questionnaire, an
instrument with which you are probably familiar. Also, interviews are
sometimes used.

b) Observation

When observation is used, data are collected not by asking but by
observing. A person being observed usually does not write anything;
he or she does something and that behaviour is observed and recorded.
For certain evaluation questions, observation is clearly the most appropriate
approach. To use an example, you could ask students about theirsportsmanship
and you could ask teachers how they handle behaviour
problems, but more objective information would probably would be
obtained by actually observing students at sporting events and teacher
in their classrooms. Two types of observation which are used in evaluation
efforts are natural observation and observation of simulation. Certain
kinds of behaviours can only be observed as they occur naturally. In
such situations, the observer does not control or manipulate anything,
and in fact works very hard at not affecting the observed situation in
any way. As an example, classroom behaviour can best be addressed
through observation. In simulation observation, the evaluator creates
the situation to be observed and tells participants what activities they
are to engage in. This technique allows the evaluator to observe behaviour
which occurs infrequently in natural situations or not at all.

c) Rating scale

Is an instrument with a number of items related to a given variable,
each item representing a continuum of categories between two extremes;
persons responding to items place a mark to indicate their position
on each item. Rating scales can be used as self-report or as n observation
instrument, depending on the purpose for which they are used.

Behaviours Measured

Virtually all possible behaviours that can be measured fall into one of three major
categories:
achievement, character and personality, and aptitude.
All of these
can be standardized or locally-developed. These categories also apply equally
well to the three domains of educational outcome, namely cognitive, affective
an psychomotor.

1. Achievement

Achievement tests measure the correct status of individuals with respect to proficiency
in given areas of knowledge or skills. Achievement tests are appropriate for
many types of evaluation besides individual student evaluation. Achievement test
can be standardized (which are designed to cover content which are common to
many classes of the same type) or locally-developed (designed to measure particular
set of learning outcomes, set by specific teacher). Standardized tests are, in
turn, available for individual curriculum areas or in the form of batteries, which
measure achievement in several different areas.
A diagnostic test is a type of achievement test which yields multiple scores for
each area of achievement; these scores facilitate identification of specific areas or
deficiency or learning difficulty. Items in a diagnostic test are intended to identify
skills and knowledge that students must have achieved before they proceed toanother level.
Ideally, diagnosis should be an ongoing process and the teacher
must design them in a way that such tests help him/her find out the problems
that students encounter as they proceed in the learning process.

2. Character and Personality

Tests of character and personality are designed to measure characteristics of individuals
along a number of dimensions and to assess feelings and attitudes toward
self, others, and a variety of activities, institutions and situations. Most of the
tests of character and personality are self-report measures and ask an individual
to respond to a series of questions or statements. There are instruments in this
category which are designed to measure personality, attitudes, creativity, and
interest of students.

3. Aptitude

Aptitude tests are measures of potential. They are used to predict how well
someone is likely to perform in a future situation. Tests of general aptitude are
variously referred to as scholastic aptitude tests, intelligence tests, and tests of
general mental ability. Aptitude tests are also available to predict a person’s likely
level of performance after following some specific future instruction or training.
Aptitude tests are available in the form of individual test on specific subject or
content or in the form of batteries. While virtually all aptitude tests are standardized
and administered as part of school testing program, the results are useful to
teachers, counsellors and administrators. Readiness aptitude tests (or prognostic
tests) are administered prior to instruction or training in a specific area in order to
determine whether and to what degree a student is ready for, or will profit from,
an instruction. Readiness tests, which are part of aptitude tests, typically include
measurement of variables such as auditory discrimination, visual discrimination
and motor ability.

Performance Standards

Performance standards are the criteria to which the results of measurement are
compared in order to interpret them. A test score in and of itself means nothing.
If I tell you that Ahmed got 18 correct, what does that tell you about Ahmed’s
performance? Absolutely nothing. Now if I tell that the average score for the test
was 15, at least you know that he did better than the average. If instead I tell you
that a score of 17 was required for a master classification, you don’t know anything
about the performance of the rest of the class, but you know that Ahmed
attained mastery. These are the two ways with which we can interpret the results
of a test, first by comparing it to other students in the class (that is Norm-Referenced
Measurement) or by comparing it to a pre-determined criteria (that is
Criterion-Referenced Measurement).

Norm-referenced standards

Any test, standardized or locally-developed, which reports and interprets each
score in terms of its relative position with respect to other scores on the same test,
is norm-referenced. If your total IQ score is 100, for example, the interpretation
is that your measured intelligence is average, average compared to scores of a
norm group. The raw scores resulting from administration of standardized test
are converted to some other index which indicates relative position. One such
equivalent technique familiar to you is the percentile. A given percentile indicates
the percentage of the scores that were lower than the percentile’s corresponding
score. For example, Mr. Ahmed might have scored on the 42nd percentile in a
standardized math test, that means 42% of the students who took that test scored
below Ahmed. In such way of communicating student scores, there is no indication
of what Mr. Ahmed knows or does not know. The only interpretation are in
terms of Ahmed’s achievement compared to the achievement of others.
Norm-referenced tests are based on the assumption that measured traits involve
normal curve properties. The idea of the normal curve is that measured traits
exist in different amounts in different people. Some people have a lot of it, some
people have little of it, and most have some amount called the ‘average’ amount.
For example, if you administer a math test to a class of 100 students, given that
the test of appropriate level – that is not too easy nor too low – a small portion
of the class will perform high and another equal portion will perform low, while
the majority will perform around the average score.

Criterion-referenced standards

Any test which reports and interprets each score in terms of an absolute standard
is criterion-referenced. In other words, interpretation of one person’s score has
nothing to do with anybody’s score. A score is compared with a standard or performance,
not with scores of other people. When criterion-referenced tests are used,
everyone taking the test may do well or everyone may do poorly. In this context,
criterion can be defined as a domain of behaviours measuring an objective.