Using latent semantic indexing for literature based discovery
Using Latent Semantic Indexing for Literature Based Discovery Michael D. Gordon Computer and Information Systems, School of Business, University of Michigan, Ann Arbor, MI 48109-1234. E-mail: [email protected] Susan Dumais Microsoft Research, Redmond, WA 98052. E-mail: [email protected] Latent semantic indexing ( LSI ) is a statistical technique
As described by Swanson, there are two basic literature
for improving information retrieval effectiveness. Here,
discovery processes. The first leads from the literature
we use LSI to assist in literature-based discoveries. The
( R ) associated with an initial topic to the literatures ( I )
idea behind literature-based discoveries is that different
of one or more related, intermediate topics. The second
authors have already published certain underlying scien- tific ideas that, when taken together, can be connected
leads from one of these related topics to the literature
to hypothesize a new discovery, and that these connec-
( PD ) associated with a potential discovery. Figure 1 illus-
tions can be made by exploring the scientific literature.
trates these two steps ( left to right ) . We explore latent semantic indexing’s effectiveness on
We call these two processes identifying intermediatetwo discovery processes: uncovering ‘‘nearby’’ relation- literatures and identifying potential discovery literatures,ships that are necessary to initiate the literature based discovery process; and discovering more distant rela-
respectively ( Fig. 1 ) . Our interest is learning if latent
tionships that may genuinely generate new discovery
semantic indexing ( Deerwester et al., 1990 ) , a statistical
hypotheses.
technique used with success in information retrieval, canhelp with either or both of these processes. Introduction Identifying Intermediate Literatures
Literature-based discovery uses the published, scien-
tific literature as a source of new discovery. First dis-
By definition, if we start with Raynaud’s and discover
cussed by Swanson ( 1986 ) in connection with Raynaud’s
a brand new concept ( cure, cause, treatment, or physio-
disease, the problem can be characterized in this way:
logical process ) never before reported, there will be no
Beginning with the literature, R ( for Raynaud’s ) , on some
document that discusses both Raynaud’s and this new
subject, can you identify the literature on another subject
concept. But there may be a topic that is discussed along
that helps in better understanding R , even though no one
with Raynaud’s and is also discussed along with the new
has ever thought that these two subjects were related? 1 In
concept, even though no single article on this topic dis-
a series of papers, Swanson ( 1986a, 1986b, 1987, 1988a,
cusses both. A literature that serves as such a bridge is
1988b, 1989a, 1989b, 1989c, 1990a, 1990b, 1991, 1993 )
showed this could be done, both by intensive reading
Finding intermediate literatures, then, is a central prob-
and study, and by semiautomatic methods involving text
lem in literature-based discovery. Of course, one can read
analysis. Subsequently, Gordon and Lindsay ( 1996 ) have
about Raynaud’s and form impressions on that basis, but
replicated Swanson’s results and used other statistical
a systematic approach for identifying intermediate litera-
methods to help automate the literature discovery process.
tures would be more efficient and possibly more effective.
The following is an example of a MEDLINE record
containing the term Raynaud’s ( with slight cosmetic mod-
1 Literature based discoveries generate scientific hypotheses; con-
ifications to illustrate more plainly the record’s structure ) :
ventional scientific research must be conducted if the hypothesis is tobe confirmed.
TITLE: Localized real-time blood flow measurements. AUTHOR: van As H; Brouwers AA; Snaar JE
Received January 31, 1996; revised April 30, 1997; accepted April 30,
CITE: Arch Int Physiol Biochim 1985 Dec; 93 ( 5 ) : 87 –
᭧ 1998 John Wiley & Sons, Inc.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE. 49 ( 8 ) :674 – 685, 1998
ABSTRACT: A novel method for real time, localized,
Four statistics used to identify intermediate literatures.
flow measurements is applied to blood flow in human
fingers. Results for arterial and venous flow in normalsubjects and patients with abnormal blood circulation are
number of tokensb of X within R
presented. Effects of blood flow regulation by the auto-
number of records in R containing X
nomic nervous system have been observed. Stricture of
the digital arteries could be clearly demonstrated in a
patient with Raynaud’s phenomenon. Experimental sig-
nals due to pulsatile flow in a model system can be simu-
lated in a quantitative way. The calibration, however,
depends on the actual spin – spin relaxation time and the
shape of the pulsatile flow vs. time curve. Due to these
Strictly, frequencies should be ratios, but the normalizing denomi-
nators in these statistics may be dropped since what is important is
limitations, the volume flow rate can be measured with
term (or phrase) rank orderings, which are identical with and without
a relative error of approximately / / 025%. ( AUTHOR )
b Token frequencies count each distinct occurrence of a word (or
MINOR TERMS: Fingers BS. Human. Nuclear Magnetic
phrase). For instance, in the sentence ‘‘Row, row, row your boat gently
down the stream,’’ the token frequency for row is 3; for gently it is 1.
On the other hand, the record frequency of both of these items is incre-
For a term or phrase, X , these four statistics may be calculated in
relation to the literature on Raynaud’s literature, R .
Among other non-‘‘noise’’ words, this record containsblood and flow ( from the title ) , flow, blood, fingers, etc. ( from the abstract ) , plus other words from the remaining
quency of 2; and Raynaud’s had a token frequency of 1.
MEDLINE record fields. Similarly, the two-word adja-
Similarly, the phrase blood flow had a token frequency
cency phrases in this MEDLINE record include localized
of 4, whereas as blood circulation had a token frequency
real, real time, time blood, blood flow, flow measurements
of 1. For this single MEDLINE record, the record fre-
( from the title ) , n ovel method, real time, time localized,
quency for each of these words and phrases is 1. localized flow, flow measurements, blood flow, and blood
Table 2 gives an example of the four statistics that
circulation ( from the abstract ) . Standard information re-
would be computed for the term ( or phrase ) X , which
trieval techniques can eliminate from consideration non-
occurs both within and outside the Raynaud’s subset of
substantive words, such as a, for, and is, and can use
sentence punctuation to prevent the inclusion of false
Gordon and Lindsay ( 1996 ) used these statistics to try
phrases such as fingers results ( from the abstract ) .
to identify intermediate literatures for further exploration.
Gordon and Lindsay ( 1996 ) have investigated auto-
After calculating each of the four statistics for every term
mated processes for supporting the identification of inter-
or two-word adjacency phrase in a downloaded literature
mediate literatures from MEDLINE records such as these
( such as Raynaud’s ) , they identified the twenty ( or
that are based on descriptive statistics similar to those
thirty ) items with the highest values for each statistic.
used in information retrieval. Specifically, to identify in-
They then considered each of these items to be a query
termediate literatures related to the topic Raynaud’s, they
that that could be used to identify a different intermediate
downloaded the full MEDLINE records for all 1983 –
literature. Though the methods used were highly auto-
1985 2 documents that mention Raynaud’s, parsed them
mated, the intended use of these methods was to provide
as described for the sample record, and then computed
support for a qualified medical researcher who could most
the statistics shown in Table 1 for every term and two-
effectively interpret and act upon the data provided.
In examining the four separate lists of highest-ranked
For the MEDLINE record shown above, the word time
items, Gordon and Lindsay concluded that three of the
had a token frequency of 4; localized had a token fre-
statistics — token frequency, record frequency, and tokenfrequency * inverse global record frequency ( igf ) — wereextremely predictive of each other. If a particular term orphrase, such as blood, was among the top 20 positionson one of the lists, very likely it was among the top 20of another list as well. As a specific example, in analyzingthe Raynaud’s literature the four statistics were computedfor each of the approximately 2,000 single-word termsthat occurred at least four times in that literature. If the
The two steps in literature-based discovery.
terms on the top 20 list for one statistic were statisticallyindependent of those on another, a fractional number
should appear on both lists. What was observed, instead,
This date range was the same one that Swanson used and supported
Gordon and Lindsay’s replication of Swanson’s results by new methods.
was that the token frequency and record frequency lists
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
Fictitious numerical example showing calculation of four statistics for term
Document collection characteristics Number of Documents R Å documents mentioning Raynaud’s
Subset of R mentioning term (phrase) XToken characteristics Number of Token occurrences Statistics Value of Statistic tf * inverse global record frequency (tf * igf)
had fifteen ( of twenty ) items in common; the token fre-
the hypothesis that Raynaud’s might be treated by fish
quency and token frequency * igf lists had seventeen; and
oil lay dormant in the literature until Swanson ( 1986a,
the record frequency and token frequency * igf had fifteen.
1986b, 1987 ) uncovered it by methods of literature-based
In other words, an item’s appearance on the top 20 list
for one statistic was highly correlated with its appearance
To summarize, Gordon and Lindsay ( 1996 ) demon-
on the top 20 list of the other two. The same conclusion
strated three statistics that were useful for uncovering
held when the number of items per list was increased;
intermediate literatures to support literature-based discov-
when two-word adjacency phrases were considered rather
ery: token frequency, record frequency, and token fre-
than single-word terms; and when literatures other than
quency * inverse global record frequency. Each of them
separately rank-ordered large lists of terms ( and phrases )
There was not nearly the same degree of correlation
in quite similar ways. And from the starting point ( Ray-
between a term’s occurrence on the top 20 list for relative
naud’s in this example ) , a medical researcher using these
frequency and its occurrence on the top 20 list of another
statistics could be led first to blood, and then to blood
statistic. Again considering Raynaud’s as an example, the
viscosity, by any of these three statistics ( Fig. 2 ) . Gordon
top 20 items sorted by relative frequency included one
and Lindsay argued that an effective method for identi-
item in common with the top 20 token frequency items;
fying an intermediate literature is finding one with strong
one item in common with the top 20 record frequency
conceptual similarity to the starting point and that each
items; and one in common with the top 20 token fre-
of the three correlated statistics can serve this purpose,
quency * igf items. This pattern held for single words
since each has lexical prominence in the Raynaud’s litera-
and two-word adjacency phrases, when the top n size was
adjusted ( to values other than 20 ) , and when different
Latent semantic indexing ( Deerwester et al., 1990 )
offers an entirely different way potentially to identify
Not only were the token frequency, record frequency,
intermediate literatures and, thus, to support literature-
token frequency, and tf * igf lists quite similar, but they
based discovery. A standard term by document matrix,
were effective in uncovering intermediate literatures on
D , is mathematically equivalent to the product of three
a discovery path from Raynaud’s to fish oil. By looking
other matrices, as shown in Figure 3. M is a matrix of
at the very top items on any of the three lists, one was
singular values computed by a ‘‘factoring’’ process —
led from Raynaud’s ( the starting point ) to the topic blood.
singular value decomposition ( Forsythe et al., 1977 ) —
Then, by downloading and analyzing the literature on thetopic blood AND Raynaud’s, one was led directly by anyof the three statistics to the topic blood viscosity ( seeFig. 2 ) . Blood viscosity is indeed an intermediate, or‘‘bridge,’’ literature: It is mentioned in the Raynaud’sliterature and is clearly accepted scientifically as beingrelated to Raynaud’s. It is also mentioned in the fish oilliterature, and is scientifically related to that as well. In-deed, there are physiological connections implicating fishoil as a treatment for Raynaud’s, including that fish oilreduces blood viscosity and that increased blood viscosityis one of the reasons Raynaud’s patients suffer symptomsassociated with peripheral blood deficiency. Despite this,
Raynaud’s and two intermediate literatures.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
Decomposition of term by document matrix.
that expresses each of the t original indexing terms and
However, with equal applicability, latent semantic in-
also each of the d original documents as a vector of m
dexing can uncover relationships among terms. For in-
factors ( where m is the number of linearly independent
stance, the terms term-a and term-b demonstrate semantic
rows, and columns, in D ) . Technically and intuitively,
similarity by occurring together in Doc-2. Similarly,
each of the original indexing terms is now expressed as
term-a and term-c will bear a transitive, but measurable,
a vector of statistically independent factors ( and repre-
similarity to each other when a collection like the above
sented by a row of the Terms matrix ) ; each document is
is represented by means of latent semantic indexing.
similarly represented by a column of the Docs T matrix.
This latter perspective suggests that, perhaps, latent
In other words, by means of singular value decomposi-
semantic indexing provides an alternative approach to
tion, terms and documents are represented in the same
uncovering intermediate literatures. Specifically, if terms
such as Raynaud’s are thought to stand for underlying
The great benefit of representing D as a product of
concepts ( the concept Raynaud’s disease ) , then we can
three matrices is that we can consider a representational
see which terms lie near each other in LSI-space and,
space containing just the k õ m most important of these
thus, make inferences about conceptual similarity.
dimensions, for k of any size. We can then approximate
To test the usefulness of this approach, we began with
the 560 documents published during the years 1983 – 1985containing mention of the term Raynaud’s — the same
D É D Å Terms 1 M 1 DocsT=
documents used by Gordon and Lindsay and by Swanson. LSI scaling was then performed on this set of documents,
where Terms Å t 1 k; M Å k 1 k; DocsT= Å k 1 d.
and the top 100 factors were retained ( k Å 100 ) . Each
The result is an optimal reduced dimensional approxi-
document, as well as each term used in any document,
mation of D ( by a criterion of least squares ) . Practically,
was thus represented as a vector in the same 100 dimen-
this means that two documents that use strongly overlap-
ping vocabulary may both be retrieved even if a particular
A central interest of ours was to determine if this
query only uses the terms that index one of them. Simi-
method produced substantially different ( possibly better )
larly, terms will be considered ‘‘close’’ to each other if
results than Gordon and Lindsay’s method of selecting
they occur in overlapping sets of documents.
intermediate literatures on the basis of token counts, re-
Figure 4 suggests the way latent semantic indexing
cord counts, and tf * igf statistics.
assists in information retrieval, using term co-occurrences
A fairly crude measure of the similarity between the
to give support for document similarity. Pretend that the
two methods of generating items associated with Ray-
three documents shown are part of a larger collection
naud’s is to consider their overlap. To do this, a single
where term-a and term-b tend to be used together in in-
list of items representing the ‘‘best’’ intermediate items
dexing documents, as do term-b and term-c. Then, thequery term-b may still retrieve Doc-1, even though Doc-1 is not indexed by that term. Similarly, the query term-cmay retrieve Doc-1 by virtue of ‘‘transitive’’ co-occurrence. In other words, term-c co-occurs often with term-b,which co-occurs with term-a . This gives support for re-trieving Doc-1 for the query term-c. This is the ordinaryspirit in which latent semantic indexing is used — to findsimilarity among documents based on their indexing, andthus retrieve documents that do not exactly match a query.
Doc- x indexed by term-y is represented by X r Y .
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
by Gordon and Lindsay’s method was developed by tak-
The 50 nearest neighbors to Raynaud’s by LSI that were
also identified by Gordon and Lindsay’s statistical methods.
j the top 40 terms, according to record counts
j the top 40 terms, according to token counts
the top 40 terms, according to tf * igf
the top 40 two-word phrases, according to record
j the top 40 two-word phrases, according to token counts
j the top 40 two-word phrases, according to tf * igf
This union contained 136 unique items ( an item being
either a term or a two-word phrase ) .
The 136 nearest neighboring terms to the term Ray-naud’s according to the LSI analysis were then identified.
This was done by rank-ordering all terms by their cosine
to the term Raynaud’s. The size of the intersection of the
two lists of 136 times was 57 items ( approximately 42%
The best LSI-ranked items ( i.e., those with lowest
ranks ) were most likely to be in the Gordon and Lindsay
list. Table 3 shows all of the top 50 LSI-ranked terms
that also appeared in the Gordon and Lindsay list. Practi-
cally every item very close to the term Raynaud’s in LSI
space was identified by the Gordon and Lindsay methods.
In particular, of the top 10 items nearest Raynaud’s ac-
cording to LSI, Gordon and Lindsay’s methods identified
nine. Of the top 20 nearest items ( by LSI methods ) , 15
were identified by Gordon and Lindsay’s methods; of the
top 30, 21; of the top 40, 27; and of the top 50, 31 ( Fig.
5 ) . Further, in just examining the very highest-ranked
items ( those that each method recommends most stronglyas an intermediate literature ) , we find that each of thetop 10 from Gordon and Lindsay is among the top 12
signing ranks of 41 to items not appearing on a list may
items by LSI. In other words, these two lists’ top 10 items
suppress slightly the effects of outlying ranks ) . More
are nearly permutations of each other. In addition, the
simply, an item’s approximate position on the Gordon
very highest items in one list tend to be right at the top
and Lindsay list ( whether near the top, in the middle, or
near the bottom) will predict its approximate position on
In fact, a more sensitive analysis was conducted to test
for a correlation between the top LSI rank positions and
Of course, other methodological approaches could be
the top Gordon and Lindsay ranks. Since the Gordon and
taken to compute rank correlations, including forming an
Lindsay list was the union of six separate lists and an
‘‘average rank’’ for each Gordon and Lindsay two-word
item could come from one or more of them, it would not
phrase ( based on its three separate ranks ) . However, be-
have a unique rank across different lists. Arbitrarily, then,
cause the three statistics Gordon and Lindsay used to
we selected the Gordon and Lindsay two-word phrase list
determine intermediate literatures were so strongly corre-
ranked by record counts to provide ranks for use in our
lated, this is unlikely to affect our finding in any apprecia-
analysis. These 40 times were Spearman rank-correlated
with the 40 highest-ranking two-word phrases identified
One surprising observation from Table 4 deserves a
by LSI scaling ( retaining k Å 200 factors ) . A two-word
comment. The phrase d ouble blind is the best-ranked
phrase that occurred in one list but not in the other was
phrase in LSI but is not among the top 40 items from the
assigned a rank of 41 in the list in which it did not appear.
Gordon and Lindsay analysis ( it had rank 45, occurring
The null-hypothesis tested was that the top 40 ranks of
in 11 records ) . A possible explanation is that the term
the Gordon and Lindsay and the LSI lists were uncorre-
Raynaud’s lies near the phrase d ouble blind in MED-
lated. Data and results are shown in Tables 4 and 5.
LINE. More likely, the prominence of d ouble blind may
By this analysis, we can conclude that the top 40 Gor-
be somewhat coincidental and actually result from the fact
don and Lindsay two-word phrases ( by record counts )
that the phrase occurred in just 11 of the 560 Raynaud’s
are rank-correlated with those found by LSI ( even if as-
documents analyzed ( 14 times in total ) , but was near
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
Percentage of top N LSI items identified by Gordon and Lindsay. Raynaud’s by tending to co-occur with all of its chief
Raynaud’s and some unknown cause, cure, or treatment
for this condition. This conjecture may have been stimu-
What conclusions do these various analyses suggest?
lated by using a literature-based discovery tool, or it may
Principally, that there is a strong overlap among the terms
have arisen simply by reading and thinking about Ray-
uncovered by LSI scaling and by the Gordon and Lindsay
naud’s. Figure 6 may help make this clearer. The sugges-
techniques, and that this overlap is strongest among the
tion is that a concept 3 that is related to blood viscosity
very-top-ranked items by each method. Gordon and Lind-
but not directly to Raynaud’s may be uncovered through
say have argued that the best terms for identifying inter-
mediate literatures are those very close ( semantically and
Since blood viscosity is conjectured to be a bridge to
statistically ) to the starting point. By this argument, the
some unknown discovery, it can be the focus for LSI
two methods may provide similar, but complementary,
scaling. Selecting the blood viscosity literature to perform
approaches for identifying intermediate concepts.
LSI processing on would certainly appear to be an advan-
In the next section, we change our focus and discuss
tage in finding hidden connections to Raynaud’s since
the use of LSI for identifying potential discovery litera-
blood viscosity is, in fact, a bridge to a hidden treatment
( fish oil ) . To test the effectiveness of this use of latentsemantic indexing, we proceeded as follows. The 809MEDLINE records published between 1983 – 1985 and
Identifying Potential Discovery Literatures
mentioning blood viscosity were downloaded and LSI-
A connotation underlying the phrase latent semantic
processed ( retaining k Å 100 most important factors ) . A
indexing is that hidden relationships among concepts exist
list of closest neighbors to the term Raynaud’s was then
and, further, that they may be teased out statistically.
constructed according to their cosine to Raynaud’s, but
Figure 4 has already illustrated how the concept identified
no element on the list could appear in any of the 560
by term-c may bear a latent relationship to the concept
Raynaud’s documents from the same period. In other
identified by term-a because both terms co-occur with
words, we constructed a list of terms that were ‘‘near’’
Raynaud’s ( from the perspective of blood viscosity ) but
Is it possible, then, that LSI can form a bridge that
were nonetheless bibliographically disjointed from it. The
connects two bilbiographically isolated literatures? From
items on this list would certainly seem worthy of further
Swanson’s work ( 1986a, 1986b, 1987 ) , for example, we
know that the concept blood viscosity is scientifically re-
A specific interest was whether the phrase fish oil
lated to both Raynaud’s and to fish oil, but that neither
would appear prominently on this list. More generally,
the Raynaud’s nor the fish oil literature refers to each
we wanted to see which terminal concepts contained in
other, nor are they mentioned together by other docu-ments.
3 Implicitly, we are assuming that a term used in text represents the
Suppose, however, that one had conjectured that blood
concept with the same name. Accordingly, the term Raynaud’s would
viscosity is an important intermediate literature linking
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
Top 40 phrases and ranks by LSI and Gordon and Lindsay
a possible cause, cure, or treatment for Raynaud’s. A
topic such as tissue hypoxia ( hypoxia means a decreasein normal levels of oxygen ) might be related to Raynaud’s
but cannot be considered a terminal concept by the same
lines of reasoning, thus it was excluded from our list. Thelist of all Raynaud’s neighbors that were bibliographically
disjoint from Raynaud’s and had a cosine ( to Raynaud’s )
of 0.005 or more was examined by hand to remove nonter-
minals.4 Table 6 shows the items that remained.
Ignoring the final column, each row in the table shows
the value of cosine ( Raynaud’s, terminal term) ; a terminal
term’s ‘‘Rank in LSI, non-Raynaud Terms,’’ which only
considers terms appearing in blood viscosity documents
but not appearing in a Raynaud’s document; and a termi-
nal term’s ‘‘Rank in LSI space, all Terms,’’ which tells
how many terms had a larger cosine to Raynaud’s, includ-
ing all blood viscosity terms and two-word phrases ( in
any of the 809 blood viscosity documents ) . For instance,
149 terms had a larger cosine than the term hydroxy-chloroquine, but hydroxychloroquine’s rank of six among
non-Raynaud’s items means that there were only five
higher-ranked items, each judged a nonterminal, that ap-
peared in blood viscosity documents but not in Raynaud’s
documents, including the items viscosities and motor ac-tivity. Notice that hydroxychloroquine is the only terminal
term in Table 6 that has a cosine value of above 0.10.
By definition, none of the terms in Table 6 appeared
in any of the 560 1983 – 1985 Raynaud’s documents; the
three-year time span was chosen to correspond as closely
as possible to the documents Swanson ( 1986a, 1986b,
1987 ) examined in his Raynaud’s studies. It is possible,
of course, that some of the terms in Table 6 occurred
along with the term Raynaud’s before 1983. Because we
are, in effect, investigating Swanson’s literature-based
discovery of the Raynaud’s – fish oil connection, we can
ignore co-occurrences after 1985. So we queried MED-
LINE to determine the number of documents containing
both the term Raynaud’s and each one of the terms in
Table 6 in any year before 1986. Results are shown in
the last column of the table. This column indicates which
terminal terms we can rule out as possible discoveries by
Spearman rank correlation for LSI and Gordon and
4 Currently, automatic processing of text is incapable of determining
terminal concepts. Thus, identification of terminals must be conductedby hand. This manual step does not diminish our approach, whose
the list might suggest a new discovery about Raynaud’s.
objective is to support hypothesis discovery, not automate it. Terminals
By way of an example, a substance such as aspirin was
can rapidly be selected from lists of terms and phrases, especially by
considered a terminal in the sense that it can be considered
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
ing has ruled out terms and phrases that were mentionedalong with Raynaud’s in articles written before 1983. Butits cosine, at 0.016, is a faint signal.
Two other items with null-intersection to Raynaud’s
may deserve further study if experts in the field shouldconfirm their merit. Calcium dobesilate has been used forvascular diseases and diabetes retinopathy, among otherconditions. It has been shown to reduce blood viscosity,improve venous insufficiency, and reduce platelet deposi-tion. Niceritrol has been used to treat hyperlipidemia, and,in addition, it has beneficial effects on blood viscosity
Blood viscosity as intermediate literature.
and platelet aggregation. The effects of these drugs arerelated to treating Raynaud’s.
It is interesting to note, too, that among the items in
virtue of having been already discussed with Raynaud’s
Table 6 which have non-empty intersection with Ray-
( those with a nonempty intersection ) .
naud’s are substances such as isoxsuprine and dextran,
In regard to our effort to discover directly the Ray-
which have been used to treat Raynaud’s. In addition,
naud’s – fish oil connection, the results are disappointing. Fish oil is on the list of nonintersecting items, but is
some of the nonterminal, but nonintersecting, items pro-
nowhere near the term Raynaud’s ( being its 1961st clos-
duced by the analysis suggest possible avenues to exam-
est neighbor ) and still behind almost 600 other terms
ine in connection with Raynaud’s. For instance, lysoleci-
that appear in the blood viscosity, but not Raynaud’s,
thin, an acid formed by an enzymatic process in the blood,
literature. However, eicosapentaenoic acid, the active
is capable of breaking up red blood cells and thus may
agent in fish oil, fares much better, being the 208th-ranked
prove useful in treating Raynaud’s. This conjecture, too,
newly uncovered item in relation to Raynaud’s, and the
can be appropriately evaluated by medical experts.
fifth-ranked terminal when additional MEDLINE search-
A variation on this approach to finding new connec-
Terminal concepts identified as Raynaud’s neighbors plus their MEDLINE intersections with
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
tions to Raynaud’s is to find the nearest neighbor to the
Poisson approximations of size of topic in sample.
centroid of all documents comprising the Raynaud’s liter-
ature — instead of to the term Raynaud’s. The centroid of
a cluster of items is its central value, and is computable
in a variety of ways. Experiments showed that considering
Raynaud’s to be a document centroid, rather than a term,
Directly Identifying Potential Discovery Literatures
It is interesting to note that, since Raynaud’s and fish
oil truly are medically related, and since this relationship
can be detected by other methods, LSI does not directly
uncover this latent association, especially since LSI scal-
ing was performed on the blood viscosity literature, which
The problem may be one of scale. By analogy, a glance
at a globe suggests that New York City and Boston are
near each other. But they are anything but neighbors when
considering only the northeast seaboard of the United
States. The same may be true of the Raynaud’s – fish oil
association. In the broad context of medicine, these con-
cepts clearly are linked by the bridge of blood viscosity.
Nevertheless, blood viscosity ( used as the focus for LSIprocessing ) may be an improper vantage from which to
Size of MEDLINE (1980 – 1985) Å 780,000;S Å sample size Å 18,499;
detect the association. We may need to ‘‘back up’’ to
p Å topic base rate is MEDLINE Å n/780,000.
gain some perspective, just as we can only see that Bostonand New York City are near each other when our perspec-tive is the globe.
period 1980 – 1985, there is an approximately 30.5%
An experiment was conducted to explore this possibil-
chance that it will not be represented in the of 18,499
ity by attempting to identify a potential discovery litera-
documents drawn. On the other hand, by the time a topic
ture without first selecting an intermediate literature. In
is of size 200, there is a 95% chance that the sample will
principle, we desired to analyze all of MEDLINE from
contain at least two documents on that topic. All told, the
the period 1980 – 1985 ( nearly 780,000 records ) . For
method of sampling used likely provided a fair approxi-
practical and computational reasons, this was not possible.
mation to a genuinely random sample.
Instead, we tried to obtain an approximately random sam-
LSI processing proceeded along the lines already de-
ple of MEDLINE from the given date rage. We did this
scribed: The set of documents and terms was represented
by obtaining all MEDLINE records written in English,
by k Å 300 orthogonal factors ( as opposed to 100 in the
containing an abstract and at least one reference, and with
previous experiment ) to adjust for the larger collection
a publication date between 1980 and 1985. By doing
size. In this space, there were just over 36,000 terms or
so, 18,499 records were identified and downloaded for
phrases that were not among those mentioned in the
processing ( a sample of about 2.5% for the period ) . It is
1983 – 1985 Raynaud’s document collection. From these
possible that including only English-language items in
new items, a list of the 1,000 closest neighbors to Ray-
the sample may have introduced some bias, for instance
naud’s was generated. When we then hand-selected termi-
in the areas of pharmacology, where different areas of
nals from this list, we obtained a list of 37 items.
the world have approved different drugs for the same
To ensure that an item did not occur in a Raynaud’s
illness. This concern is reduced by noting that research
document earlier than 1983, we consulted the entire
performed in Europe and elsewhere around the world has
MEDLINE document collection to find the number of
a significant representation in the sample, since much
documents published any time before 1986 that used both
scientific publication is in English. It is also possible,
that item and the term Raynaud’s. The cosine, rank, and
though unlikely, that the constraint that all records contain
intersection data for the hand-selected terminal items are
an abstract and reference( s ) distorted the sample in some
unintended fashion. Of course, the size of this sample
Among the list of items in Table 8 are those with
means that some very small topics were likely excluded
already known connections to Raynaud’s, including meth-
from it. For instance ( see Table 7 ) , if there are only 50
ysergide, hydralazine, and isoxsuprine. Although these
documents about a given topic in MEDLINE during the
cannot be considered discoveries, their inclusion rein-
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
Terminal concepts identified as Raynaud’s neighbors.
* Items with two values, like 0 ( 13 ) for diltiazem hydrochloride, show ( 1 ) the size of the intersection
with Raynaud’s of the entire phrase ( diltiazem hydrochloride ʝ Raynaud’s Å 0); and (2) the size of theintersection of its chief chemical constituent ( diltiazem ʝ Raynaud’s Å 13).
forces the idea that LSI processing can help detect possi-
we have shown support, but do not automate, discovery,
ble treatments for Raynaud’s when a broad, unfocused
and that their appropriate interpretation should come from
literature ( a random subset of MEDLINE ) is processed
medical researchers familiar with the topic. For instance,
without the benefit of a predefined connection, such as
several of the drugs mentioned in Table 8 are calcium
blood viscosity, to link them. Among the items in Table
channels blockers; and the nonterminal phrase c alcium
8 are also substances never used before to treat Raynaud’s
blocking has a very high cosine ( 0.573 ) . So, without
that may deserve exploration as Raynaud’s treatments
additional evidence to the contrary, a possibility is that
if they were to pass the review of experts in medical
calcium channel blockers may be effective in the treat-
therapeutics. These include vasodilating agents, such as
ment of Raynaud’s, and the nonterminal concept, calcium
perhexiline, diltiazen hydrochloride, nylidrin, and li-
channel blocking, could itself be analyzed as an interme-
doflazine; drugs for treating ischemia, i.e., insufficient
diate literature ( its literature downloaded, parsed, and sta-
blood flow, such as dihydropyridine derivatives, including
tistics computed ) in the search for terminals disjoint from
nitrendipine, gallopamil, nisoldipine, and bepridil; and
Raynaud’s. On the other hand, research pharmacologists
antihypertensive drugs such as diazoxide, captopril, and
familiar with calcium channel blockers might know, for
example, that those that affect peripheral blood flow ( such
We emphasize again that these analyses and all others
as nifedipine ) have already been tested as treatments for
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
Raynaud’s, whereas the calcium channel blockers in Ta-
larger literature ( MEDLINE, in this study ) that forms
ble 8 affect the heart, thus making them ineffective as
the universe of discourse. In studying new discoveries in
Raynaud’s treatments. So for those with the requisite
connection to Raynaud’s disease, the first method was
background knowledge, the computed statistics should
able to identify fairly prominently a chief chemical con-
help stimulate useful conjectures that may lead to the
stituent ( eicosapentaenoic acid ) in fish oil using the litera-
ture from a time when the healthful effects of fish oil on
None of the terms in Table 8 with empty intersection
Raynaud’s were unknown. The phrase fish oil was not
with Raynaud’s was among the list of nonintersecting
nearly as prominent. The second method revealed a very
terms in Table 6 ( the equivalent table for the previous
experiment, where the blood viscosity literature was LSI-
It is important to remember that tools and analyses
processed ) . In fact, only one term, isoxsuprine, was com-
like those we have described in this paper support, but
mon to both tables even when we consider both terms
do not in any way replace, scientists. The skilled scientist
with empty and nonempty intersection with the term Ray-
may see patterns in data like those we report that derive
naud’s. As we suspected, a MEDLINE focus for LSI
from his or her knowledge of the field. One scientist
has certainly produced a different set of Raynaud’s near
may see, for example, that a particular class of drugs is
neighbors than did a blood viscosity focus. In this sense,
prominently represented in the data and begin to form
LSI processing of MEDLINE to search for potential dis-
hypotheses about this drug class’s ability to treat Ray-
coveries directly is another tack to consider in attempting
naud’s. A pharmacologist with a more complete back-
to uncover latent medical discoveries.
ground in the area may know that certain of these drugs
We considered a variation of this method in an attempt
are primarily known for their effects on the heart, rather
to adopt the MEDLINE focus for LSI processing while
than the peripheral vascular system. This type of knowl-
retaining some of the advantages of considering blood
edge could help isolate the drugs that truly merit scientific
viscosity an intermediate literature: We restricted the list
investigation by suggesting a more focused analysis.
of items in Table 8 to those that were both bibliographi-
The premise behind literature-based discovery support
cally disjoint from Raynaud’s and present in the blood
is that medical specialization makes it virtually impossible
viscosity literature. Only three items met these criteria:
for a scientist to stay abreast of developments in areas
methyldopa ad, methoxamine, and c aptopril ad. How-
outside his or her area of direct interest. As a consequence,
ever, in looking at articles on methyldopa and captopril,
important connections crossing disciplinary boundaries
we learned that both had been studied as a treatment for
may never be noticed. Literature-based discovery support
Raynaud’s. The reason for this apparent contradiction is
tools can help organize the knowledge of scientific fields
that the phrases identified, methyldopa ad and c aptopril
that lie outside a scientist’s direct specialization, thus im-
ad, where ad is a MEDLINE subheading meaning ‘‘ad-
proving his or her ability to organize and make use of
ministration and dosage’’ were not used in the Raynaud’s
literature, even though both of these drugs had been writ-
LSI is one tool that may help in this effort. Additional
ten about without the ad subheading. Methoxamine
research is needed to provide a broader array of tools.
causes vasoconstriction and, as such, would be contraindi-
Among other tools that we are investigating are those for:
( 1 ) reporting data at several levels of abstraction ( e.g.,counting as statistical evidence for calcium channelblockers any drug that is in this drug family ) ; ( 2 ) looking
Summary and Discussion
for evidence suggestive of ‘‘causal’’ relationships in the
Our investigation suggests that latent semantic in-
literature ( which may be revealed independently of their
dexing might be a useful tool in literature-based discov-
statistical prominence ) ; and ( 3 ) using semantic and
ery. Because of the difficulty of the task, literature-based
category knowledge to improve the step of identifying
discovery may be totally unsuccessful for certain prob-
terminal concepts, which is now a completely intellectual
lems, or by certain methods. LSI provides another tech-
process. Through these efforts, we hope to provide scien-
nique that can be considered in looking to uncover hidden
tists methods that support their efforts to generate discov-
ery hypotheses that lie latent in the published literature.
We have shown that latent semantic indexing might be
a useful technique in either of the two phases of literature-
Acknowledgments
based discovery. During the search for intermediate litera-tures, it fairly closely reproduces ( but extends ) the same
This research was conducted while Michael Gordon
set of highly ranked terms and phrases that Gordon and
was on sabbatical at Bellcore. He thanks Tom Landauer
Lindsay ( 1996 ) have shown are a useful starting point
and Michael Lesk for that opportunity. This work benefit-
for literature-based discover. In helping identify potential
ted from discussions with Tom Landauer, George Furnas,
discovery literatures, LSI can be used in either of two
Jeff Zacks, Robert Lindsay ( University of Michigan ) ,
ways: by factoring a set of documents associated with a
and Don Swanson ( University of Chicago ) . The authors
suspected intermediate literature, or by analyzing the
also thank the anonymous referees for their careful re-
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
not bibliographically connected. Journal of the American Society for
views and their help in strengthening the medical and
Information Science, 38, 228 – 233.
methodological arguments contained in the paper.
Swanson, D. R. ( 1988a ) . Migraine and magnesium: Eleven neglected
connections. Perspectives in Biology and Medicine, 31, 526 – 557.
Swanson, D. R. ( 1988b ) . Unnoticed connections in the literature of
medicine: Implications for knowledge representation and natural-lan-
References
guage searching. 1988 ASIS Mid-Year Meeting, Ann Arbor, MI.
Swanson, D. R. ( 1989a ) . A second example of mutually isolated medi-
Deerwester, S., et al. ( 1990 ) . Indexing by latent semantic analysis.
cal literatures related by implicit unnoticed connections. Journal ofJournal of the American Society for Information Science, 41, 391 –
the American Society for Information Science, 40, 432 – 435.
Swanson, D. R. ( 1989b ) . Online search for logically related noninter-
Forsythe, G. E., Malcolm, M. A., & Moler, C. B. ( 1977 ) . Computer
active medical literatures: A systematic trial and error strategy. Jour-methods for mathematical computations ( chapt. 9 ) . Englewood
nal of the American Society for Information Science, 40, 356 – 358.
Swanson, D. R. ( 1989c ) . Medical literatures as a source of new knowl-
Gordon, M. D., & Lindsay, R. K. ( 1996 ) . Toward discovery support
edge. USDE Final Report, Dec. 1989.
systems: A replication, re-examination, and extension of Swanson’s
Swanson, D. R. ( 1990a ) . Somatomedin C and Arginine: Implicit con-
work on literature-based discovery of a connection between Ray-
nections between mutually isolated literatures. Perspectives in Biology
naud’s and fish oil. Journal of the American Society for Informationand Medicine, 33, 157 – 186.
Swanson, D. R. (1990b). Medical literature as a potential source of new
Swanson, D. R. ( 1986a ) . Fish oil, Raynaud’s syndrome, and undiscov-
knowledge. Bulletin of the Medical Library Association, 78, 29–37.
ered public knowledge. Perspectives in Biology and Medicine, 30, 7 –
Swanson, D. R. ( 1991 ) . Complementary structures in disjoint science
literatures. Proceedings of the Fourteenth Annual International ACM
Swanson, D. R. ( 1986b ) . Undiscovered public knowledge. LibrarySIGIR Conference, ( pp. 280 – 289 ) .
Swanson, D. R. ( 1993 ) . Intervening in the life cycles of scientific
Swanson, D. R. ( 1987 ) . Two medical literatures that are logically but
knowledge. Library Trends, 41, 606 – 631.
JOURNAL OF THE AMERICAN SOCIETY FOR INFORMATION SCIENCE—June 1998
The Starting Dose of Levothyroxine in Primary Hypothyroidism Treatment A Prospective, Randomized, Double-blind Trial Annemieke Roos, MD; Suzanne P. Linn-Rasker, MD; Ron T. van Domburg, PhD;Jan P. Tijssen, PhD; Arie Berghout, MD, PhD, FRCP Background: The treatment of hypothyroidism with le- parable in the full-dose (n = 25) vs the low-dose groupvothyroxine is effective and simple; ho
QUALIFICATIONS SUMMARY 23 years of leadership, problem solving, portfolio and project management experience with a focus on execution and streamlining operations. PROBLEM SOLVING AND ANALYSIS Created and implemented Portfolio Management Office (PMO) metrics, reporting, trend analysis and forecasting. Led three productivity and cost-reduction engineering projects at GE with m