THE GROSSMAN-CORMACK GLOSSARY OF
TECHNOLOGY ASSISTED REVIEW
(Version 1.0, October 2012)
Copyright 2012 Maura R. Grossman and Gordon V. Cormack
“Disruptive technology” is a term that was coined by Harvard Business School professor Clayton M. Christensen, in his 1997 book The Innovator’s Dilemma
, to describe a new technology that unexpectedly displaces an established technology. The term is used in the business and technology literature to describe innovations that improve a product or service in ways that the market did not expect, typically by designing for a different set of consumers in the new market and later, by lowering prices in the existing market. Products based on disruptive technologies are typically cheaper to produce, simpler, smaller, better performing, more reliable, and often more convenient to use. Technology assisted review (“TAR”) is such a disruptive technology. Because disruptive technologies differ from sustaining technologies – ones that rely on incremental improvements to established technologies – they bring with them new features, new vernaculars, and other challenges.
The introduction of TAR into the legal community has brought with it much confusion because different terms are being used to refer to the same thing (e.g.
, “technology assisted review,” “computer-assisted review,” “computer-aided review,” “predictive coding,” and “content based advanced analytics,” to name but a few), and the same terms are also being used to refer to different things (e.g.
, “seed sets” and “control sample”). Moreover, the introduction of complex statistical concepts, and terms-of-art from the science of information retrieval, have resulted in widespread misunderstanding and sometimes perversion of their actual meanings.
This glossary is written in an effort to bring order to chaos by introducing a common framework and set of definitions for use by the bar, the bench, and service providers. The glossary endeavors to be comprehensive, but its definitions are necessarily brief. Interested readers may look elsewhere for detailed information concerning any of these topics. The terms in the glossary are presented in alphabetical order, with all defined terms in capital letters.
In the future, we plan to create an electronic version of this glossary that will contain live links, cross references, and annotations. We also envision this glossary to be a living, breathing work that will evolve over time. Towards that end, we invite our colleagues in the industry to send us their comments on our definitions, as well as any additional terms they would like to see included in the glossary, so that we can reach a consensus on a consistent, common language relating to technology assisted review. Comments can be sent to us ata
We hope you will find this glossary useful.
Accept on Zero Error
: A technique in which the training of a Machine Learning method is
gauged by taking a sample after each training step, and deeming the training process complete
when the learning method codes a sample with 0% Error (i.e.
, 100% Accuracy).
: The fraction of documents that are correctly coded by a search or review effort.
Note that Accuracy + Error = 100%, and that Accuracy = 100% – Error. While high Accuracy is
commonly advanced as evidence of an effective search or review effort, its use can be
misleading because it is heavily influenced by Prevalence. Consider, for example, a document
collection containing one million documents, of which ten thousand (or 1%) are Relevant. A
search or review effort that identified 100% of the documents as non-Relevant and therefore,
of the Relevant documents, would have 99% Accuracy, belying the failure of that
search or review effort.
: An Iterative Training regimen in which the Training Set is repeatedly
augmented by additional documents chosen by the Machine Learning Algorithm, and coded by
one or more Subject Matter Expert(s).
: The fraction of all documents that two reviewers code the same way. While high
Agreement is commonly advanced as evidence of an effective review effort, its use can be
misleading, for the same reason that the use of Accuracy can be misleading. When the vast
majority of documents in a collection are Not Relevant, a high level of Agreement will be
achieved when the reviewers agree that these documents are Not Relevant, irrespective of
whether or not they agree that any of the Relevant documents are Relevant.
: A formally specified series of computations that, when executed, accomplishes a
particular goal. The Algorithms used in E-Discovery are implemented as computer software.
Area Under the ROC Curve (“AUC”)
: From Signal Detection Theory, a summary measure
used to assess the quality of Prioritization. AUC is the Probability that a randomly chosen
Relevant document is given a higher priority than a randomly chosen Non-Relevant document.
An AUC score of 100% indicates a perfect ranking, in which all Relevant documents have higher
priority than all Non-Relevant documents. An AUC score of 50% means the Prioritization is no
better than random.
: An umbrella term for computer methods that emulate human judgment.
These include Machine Learning and Knowledge Engineering, as well as Pattern Matching (e.g.
voice, face, and handwriting recognition), robotics, and game playing.
Bag of Words
: A Feature Engineering method in which the Features of each document
comprise the set of words contained in that document. Documents are determined to be Relevant
or Not Relevant depending on what words they contain. Elementary Keyword Search and
Boolean Search methods, as well as some Machine Learning methods, use the Bag of Words
Bayes / Bayesian / Bayes’ Theorem
: A general term used to describe Algorithms and other
methods that estimate the overall Probability of some eventuality (e.g
., that a document is
Relevant), based on the combination of evidence gleaned from separate observations. In
Electronic Discovery, the most common evidence that is combined is the occurrence of particular
words in a document. For example, a Bayesian Algorithm might combine the evidence gleaned
from the fact that a document contains the words “credit,” “default,” and “swap” to indicate that
there is a 99% Probability that the document concerns financial derivatives, but only a 40%
Probability if the words “credit” and “default,” but not “swap,” are present. The most
elementary Bayesian Algorithm is Naïve Bayes, however most Algorithms dubbed “Bayesian”
are more complex. Bayesian Algorithms are named after Bayes’ Theorem, coined by the 18th
century mathematician, Thomas Bayes. Bayes’ Theorem derives the Probability of an outcome,
given the evidence, from the Probability of the outcome absent the evidence, plus the Probability
of the evidence, considering the outcome.
Bayesian Classifier / Bayesian Learning / Bayesian Filter
: A colloquial term used to describe
a Machine Learning Algorithm that uses a Bayesian Algorithm resembling Naïve Bayes.
The Probability that a random sample from a large population will
contain any particular number of Relevant documents, given the Prevalence of Relevant
documents in the population. Used as the basis for Binomial Estimation.
Binomial Estimation / Binomial Calculator
: A statistical method used to calculate Confidence
Intervals, based on the Binomial Distribution, that models the random selection of documents
from a large population. Binomial Estimation is generally more accurate, but less well known,
than Gaussian Estimation. A Binomial Estimate is substantially better than a Gaussian Estimate
(which, in contrast, relies on the Gaussian or Normal Distribution) when there are few (or no)
Relevant documents in the sample. When there are many Relevant and many Non-Relevant
documents in the sample, Binomial and Gaussian Estimates are nearly identical.
Blair and Maron:
Authors of an influential 1985 study that showed that skilled paralegals,
when instructed to find 75% or more of the Relevant documents from a document collection
using search terms and iterative search, found only 20%. That is, the searchers believed they had
achieved 75% Recall, but had in fact achieved only 20%. In the Blair and Maron study, the
paralegals used an iterative approach, examining the retrieved documents and refining their
search terms until they believed they were done. Many current commentaries incorrectly
distinguish the Blair and Maron study from current iterative approaches, failing to note that the
Blair and Maron subjects did in fact refine their search terms based on their review of the
documents that were returned from the searches.
: A Keyword Search in which the Keywords are combined using operators such
as “AND,” “OR,” and “[BUT] NOT.” The result of a Boolean Search is precisely determined by
the words contained in the documents. (See also
Bag of Words method.)
: The process of coding all members of a group of documents (identified, for
example, by De-duplication, Near-deduplication, Email Threading, or Clustering) based on the
review of only one or a few members of the group. Also referred to as Bulk Tagging.
Classical or Gaussian Estimation / Classical, Normal, or Gaussian Calculator
: A method of
calculating Confidence Intervals based on the assumption that the quantities to be measured
follow a Gaussian (Normal) Distribution. This method is most commonly taught in introductory
statistics courses, but yields unreasonably large Confidence Intervals when the proportion of
items with the characteristic being measured is small. (C.f.
Binomial Estimation / Binomial
: An Unsupervised Learning method in which documents are segregated into
categories or groups so that the documents in any group are more similar to one another than to
those in other groups. Clustering involves no human intervention, and the resulting categories
may or may not reflect distinctions that are valuable for the purpose of search and review.
Technology Assisted Review.
Technology Assisted Review.
: A marketing term used to describe Keyword expansion techniques, which
allow search methods to return documents beyond those that would be returned by a simple
Keyword or Boolean Search. Methods range from simple techniques like Stemming, Thesaurus
Expansion, and Ontology search, through statistical Algorithms like Latent Semantic Indexing.
: In Statistics, a range of values estimated to contain the true value, with a
particular Confidence Level.
: In statistics, the chance that a Confidence Interval derived from a Random
Sample will include the true value. For example, “95% Confidence” means that if we were to
draw 100 independent Random Samples of the same size, and compute the Confidence Interval
from each sample, 95 of the 100 Confidence Intervals would contain the true value. It is
important to note that the Confidence Level is not
the Probability that the true value is contained
in any particular Confidence Interval; it is the Probability that the method of estimation will yield
a Confidence Interval that contains the true value.
: A two by two table listing values for the number of True Negatives (“TN”),
False Negatives (“FN”), True Positives (“TP”), and False Positives (“RP”) resulting from a
search or review effort. As shown below, all of the standard evaluation measures are algebraic
combinations of the four values in the Confusion Matrix. Also referred to as a Contingency
Table. An example of a Confusion Matrix (or Contingency Table) is provided immediately
Coded Non-Relevant True Negatives (“TN”)
Accuracy = 100% – Error = (TP+TN) / (TP+TN+FP+FN)
Error = 100% – Accuracy = (FP+FN) / (TP+TN+FP+FN)
Elusion = 100% – Negative Predictive Value = FN / (FN+TN)
Fallout = False Positive Rate = 100% – True Negative Rate = FP / (FP+TN)
Negative Predictive Value = 100% – Elusion = TN / (TN+FN)
Precision = Positive Predictive Value = TP / (TP+FP)
Prevalence = Yield = Richness = (TP+FN) / (TP+TN+FP+FN)
Recall = True Positive Rate = 100% – False Negative Rate = TP / (TP+FN) Content Based Advanced Analytics (“CBAA”)
Technology Assisted Review.
: A Random Sample of documents coded at the outset of a search or review process,
that is separate from and independent of the Training Set. Control Sets are used in some
Technology Assisted Review processes. They are typically used to measure the effectiveness of
the Machine Learning Algorithm at various stages of training, and to determine when training
: The practice of narrowing a large data set to a smaller data set for purposes of review,
based on objective or subjective criteria, such as file types, date restrictors, or Keyword Search
terms. Documents that do not match the criteria are excluded from the search and any further
: A given score or rank in a prioritized list, resulting from a Relevance Ranking search or
Machine Learning Algorithm, such that the documents above the Cutoff are deemed to be
Relevant and documents below the Cutoff are deemed to be Non-Relevant. In general, a higher
cutoff will yield higher Precision and lower Recall, while a lower cutoff will yield lower
Precision and higher Recall. Also referred to as a Threshold.
Da Silva Moore
: Da Silva Moore
v. Publicis Groupe
, Case No. 11 Civ. 1279 (ALC) (AJP),
2012 WL 607412 (S.D.N.Y. Feb. 24, 2012), aff’d
2012 WL 1446534 (S.D.N.Y. Apr. 26, 2012).
The first federal case to recognize Computer Assisted Review as “an acceptable way to search
for relevant ESI in appropriate cases.” The opinion was written by Magistrate Judge Andrew J.
Peck and affirmed by District Judge Andrew L. Carter.
: A step-by-step method of distinguishing between Relevant and Non-Relevant
documents, depending on what combination of words (or other Features) they contain. A Decision Tree to identify documents pertaining to financial derivatives might first determine whether or not a document contained the word “swap.” If it did, the Decision Tree might then determine whether or not the document contained “credit,” and so on. A Decision Tree may be created either through Knowledge Engineering or Machine Learning.
: A method of replacing multiple identical copies of a document by a single
instance of that document. De-duplication can occur within the data of a single custodian (also
referred to as Vertical De-Duplication), or across all custodians (also referred to as Horizontal
: The collection of documents or electronically stored information about
which a Statistical Estimation may be made.
Electronic Discovery / E-Discovery
: The process of identifying, preserving, collecting,
processing, searching, reviewing and producing electronically stored information (“ESI”) that
may be Relevant to a civil, criminal, or regulatory matter.
: The fraction of documents identified as Non-Relevant by a search or review effort, that
are in fact Relevant. Elusion is computed by taking a Random Sample from the Null Set and
determining how many (or what Proportion of) documents are actually Relevant. A low Elusion
value has commonly been advanced as evidence of an effective search or review effort (see
), but that can be misleading because absent an estimate of Prevalence, it conveys no
information about the search or review effort. Consider, for example, a document collection
containing one million documents, of which ten thousand (or 1%) are Relevant. A search or
review effort that found none
of the Relevant documents would have 1% Elusion, belying the
failure of the search. Elusion = 100% – Negative Predictive Value.
: Grouping together email messages that are part of the same discourse, so that
they may be understood, reviewed, and coded consistently, as a unit.
v. HOA Holdings LLC
, Civ. Action No. 7409-VCL, tr. and slip op. (Del. Ch.
Oct. 19, 2012). The first case in which a court sua sponte
directed the parties to use Predictive
Coding as a replacement for Manual Review (or to show cause why this was not an appropriate
case for Predictive Coding), absent any party’s request to do so. Vice Chancellor J. Travis
Laster also ordered the parties to use the same e-discovery vendor and to share a document
Error / Error Rate
: The fraction of all documents that are incorrectly coded by a search or
review effort. Note that Accuracy + Error = 100%, and that 100% – Accuracy = Error. While a
low Error Rate is commonly advanced as evidence of an effective search or review effort, its use
can be misleading because it is heavily influenced by Prevalence. Consider, for example, a
document collection containing one million documents, of which ten thousand (or 1%) are
relevant. A search or review effort that found none
of the relevant documents would have 1% Error, belying the failure of the search or review effort.
: The Harmonic Mean of Recall and Precision, often used in Information Retrieval studies as
a measure of the effectiveness of a search or review effort, which accounts for the tradeoff
between Recall and Precision. In order or achieve a high F1 score, a search or review effort must
achieve both good Recall and good Precision.
False Positive Rate.
False Negative (“FN”)
: A Relevant document that is missed (i.e.
, incorrectly identified as Non-
Relevant) by a search or review effort. Also known as a Miss.
False Negative Rate (“FNR”)
: The fraction (or Proportion) of Relevant documents that are
, incorrectly identified as Non-Relevant) by a search or review effort. Note that False
Negative Rate + Recall = 100%, and that 100% – Recall = False Negative Rate.
False Positive (“FP”)
: A Non-Relevant document that is incorrectly identified as Relevant by a
search or review effort.
False Positive Rate (“FPR”)
: The fraction (or Proportion) of Non-Relevant documents that are
incorrectly identified as Relevant by a search or review effort. Note that False Positive Rate +
True Negative Rate = 100%, and that 100% – True Negative Rate = False Positive Rate. In
Information Retrieval, also known as Fallout.
: The process of identifying Features of a document that are used as input
to a Machine Learning Algorithm. Typical Features include words and phrases, as well as
metadata such as subjects, titles, dates, and file formats. One of the simplest and most common
Feature Engineering techniques is Bag of Words. More complex Feature Engineering techniques
include Ontologies and Latent Semantic Indexing.
: The units of information used by a Machine Learning Algorithm to classify
documents. Typical Features include text fragments such as words or phrases, and metadata
such as sender, recipient, and sent date. See
also Feature Engineering.
: A search method that identifies documents that are similar to a particular
exemplar. Find Similar is commonly misconstrued to be the mechanism behind Technology-
: A graph that shows the Recall that would be achieved for a particular Cutoff. The
Gain Curve directly relates the Recall that can be achieved to the effort that must be expended to
achieve it, as measured by the number of documents that must be reviewed and coded.
Gaussian Estimation / Gaussian Calculator
: A method of Statistical Estimatation based on the
: Global Aerospace Inc.
v. Landow Aviation
, Consol. Case No. CL 61040,
2012 WL 1431215 (Va. Cir. Ct. Apr. 23, 2012). The first State Court Order approving the use
of Predictive Coding by the producing party, over the objection of the requesting party, without
prejudice to the requesting party raising an issue with the Court as to the completeness or the
contents of the production, or the ongoing use of Predictive Coding. The Order was issued by
Loudoun County Circuit Judge James H. Chamblin.
: The reciprocal of the average of the reciprocals of two or more quantities. If
the quantities are named a
, their Harmonic Mean is 1 . In information Retrieval, F1 is the
Harmonic Mean of Recall and Precision. The Harmonic Mean, unlike the more common arithmetic mean (i.e.
, average), falls closer to the lower of the two quantities.
: De-duplication of documents across multiple custodians. (Cf.
: A list of Keywords in which each Keyword is accompanied by a list of the documents
(and sometimes the positions within the documents) where it occurs. Manual indices have been
used in books for centuries; automatic indices are used in Information Retrieval systems to
identify Relevant documents.
: The manual or automatic process of creating an Index. In Electronic Discovery,
Indexing typically refers to the automatic constriction of an electronic Index for use in an
Information Retrieval System.
In Re: Actos
: In Re: Actos (Pioglitazone) Products Liability Litigation
, MDL No. 6:11-md-
2299 (W.D. La July 27, 2012). A product liability action with a Case Management Order
(“CMO”) that memorializes the parties’ agreement on a “search methodology proof of concept to
evaluate the potential utility of advanced analytics as a document identification mechanism for
the review and production” of ESI. The search protocol provides for the use of a technology
assisted review tool on the email of four key custodians. The CMO was signed by District Judge
Rebecca F. Doherty.
: In Information Retrieval, the information being sought in a search or review
effort. In E-Discovery, the Information Need is typically to identify documents responsive to a
request for production, or to identify documents that are subject to privilege or work-product
: The science of how to find information to meet an Information Need.
While modern Information Retrieval relies heavily on computers, the discipline predates the
invention of computes.
Internal Response Curve:
From Signal Detection Theory, a method of estimating the number
of Relevant and Non-Relevant documents in a population, or above and below a particular
Cutoff, assuming that the scores yielded by a Learning Algorithm for Relevant documents obey a Gaussian Distribution, and the scores for Non-Relevant documents obey a different Gaussian Distribution.
Subcategories of the overall Information Need to be identified in a search or
: The process of repeatedly augmenting the Training Set with additional
examples of coded documents until the effectiveness of the Machine Learning Algorithm reaches
an acceptable level. The additional examples may be identified though Judgmental Sampling,
Random Sampling, or by the Machine Learning Algorithm, as in Active Learning.
: The number of documents identified as Relevant by two reviewers, divided by
the number of documents identified as Relevant by one or both. Used as a measure of
consistency among review efforts. Also referred to as Overlap or Mutual F1. Empirical studies
have shown that expert reviewers commonly achieve Jaccard Index scores of about 50%, and
scores exceeding 60% are very rare.
Judgmental Sample / Judgmental Sampling
: A method in which a sample of the document
collection is drawn subjectively, so as to include the “most interesting” documents by some
criterion. Unlike a Random Sample, the statistical properties of a Judgmental Sample may not
be extrapolated to the entire collection. However, an individual (such as a quality assurance
auditor or an adversary) may use judgmental sampling to attempt to uncover defects. The failure
to identify defects may be taken as evidence (albeit not statistical evidence, and certainly not
proof) of the absence of defects.
A word that is used as part of a query for a Keyword Search.
: A search in which all documents that contain one or more specific Keywords
: Kleen Products LLC
v. Packaging Corp. of Am
., Case No. 1:10-cv-05711 (N.D. Ill.)
(Nolan, M.J.). A federal case in which plaintiffs sought to compel defendants to use Content
Based Advanced Analytics (“CBAA”) for their production, after defendants had already
employed a complex Boolean Search to identify Responsive documents. Defendants advanced
Elusion scores of 5%, based on a Judgmental Sample of custodians, to defend their use of the
Boolean Search. After two days of evidentiary hearings before, and many conferences with,
Magistrate Judge Nan R. Nolan, plaintiffs withdrew their request for CBAA, without prejudice.
: The process of capturing the expertise of a Subject Matter Expert in a
form (typically a Rule Base) that can be executed by a computer to emulate the human's
Latent Semantic Analysis
Latent Semantic Indexing.
Latent Semantic Analysis (“LSA”) / Latent Semantic Indexing (“LSI”):
Engineering Algorithm that uses linear algebra to group together correlated Features. For
example, "Windows, Gates, Ballmer" might be one group, while "Windows, Gates, Doors"
might be another. Latent Semantic Indexing underlies many Concept Search tools. While Latent
Semantic Indexing is used for Feature Engineering in some Technology Assisted Review tools, it
is not, per se
, a Technology Assisted Review method.
: A document-by-document Manual Review in which the documents are
examined in a prescribed order, typically chronological.
A state-of-the-art Supervised Learning Algorithm for Machine Learning
that estimates the Probability that a document is Relevant, based on the Features that it contains.
In contrast to the Naïve Bayes, algorithm, Logistic Regression identifies Features that
discriminate between Relevant and Non-Relevant documents.
The use of a computer Algorithm to organize or classify documents by
analyzing their Features. In the context of Technology Assisted Review, Supervised Learning
, Support Vector Machines, Logistic Regression, Nearest Neighbor, and
Bayesian Classifiers) are used to infer Relevance or Non-Relevance of documents based on the
Coding of documents in a Training Set. In Electronic Discovery generally, Unsupervised
Learning Algorithms are used for Clustering, Near-Duplicate Detection, and Concept Search.
The practice of having human reviewers individually read and code a
collection of documents for Responsiveness, particular issue codes, privilege, and/or
Margin of Error:
The maximum amount by which a Statistical Estimate might likely deviate
from the true value, typically expressed as “plus or minus” a percentage, with a particular
Confidence Level. For example, one might express a Statistical Estimate as “30% of the
documents in the population are Relevant, plus or minus 3%, with 95% confidence.” This means
that the Statistical Estimate is 30%, the Confidence Interval is 27% to 33%, and the Confidence
Level is 95%. Using Gaussian Estimation, the Margin of Error is one-half of the size of the
Confidence Interval. It is important to note that when the Margin of Error is expressed as a
percentage, it refers to a percentage of the Population, not a percentage of the Statistical
Estimate. In the current example, if there are one million documents in the Population, the
Statistical Estimate may be restated as “300,000 documents in the Population are Relevant, plus
or minus 30,000 documents, with 95% confidence.” The Margin of Error is commonly
misconstrued to be a percentage of the Statistical Estimate. However, it is incorrect to interpret
the Confidence Interval in this example to mean that “300,000 documents in the Population are
Relevant, plus or minus 9,000 documents.” The fact that a Margin of Error of “plus or minus
3%” has been achieved is not, by itself, evidence of an accurate Statistical Estimate when the
Prevalence of Relevant documents is low.
A Relevant document that is not identified by a search or review effort. Also referred to
as a False Negative.
: The fraction (or proportion) of truly Relevant documents that are not identified by a
search or review effort. Miss Rate = 100% – Recall. Also referred to as the False Negative Rate.
A Supervised Machine Learning Algorithm in which the relative frequency of
words (or other Features) in Relevant and Non-Relevant training examples is used to estimate the
likelihood that a new document containing those words (or other Features) is Relevant. Naïve
Bayes relies on the simplistic assumption that the words in a document occur with independent
Probabilities, with the consequence that it tends to yield extremely low or extremely high
estimates. For this reason, Naïve Bayes is rarely used in practice.
: National Day Laborer Organizing Network
v. U.S. Immigration and Customs
, Case No. 10-Civ-3488 (SAS), 2012 WL 2878130 (S.D.N.Y.), a Freedom
of Information Act (“FOIA”) case in which District Judge Shira A. Scheindlin held that “most
custodians cannot be ‘trusted’ to run effective searches because designing legally sufficient
electronic searches in the discovery or FOIA contexts is not part of their daily responsibilities,”
and stated (in dicta
) that “beyond the use of keyword search, parties can (and frequently should)
rely on latent semantic indexing, statistical probability models, and machine learning to find
responsive documents. Through iterative learning, these methods (known as ‘computer-assisted’
or ‘predictive’ coding) allow humans to teach computers what documents are and are not
responsive to a particular FOIA or discovery request and they can significantly increase the
effectiveness and efficiency of searches.”
A method of grouping together “nearly identical” documents. A
variant of Clustering in which the similarity among documents in the same group is very strong.
Near-Duplicate Detection is typically used to reduce review costs, and to ensure consistent
A Supervised Machine Learning Algorithm in which a new document is
classified by finding the most similar document in the Training Set, and assuming that the correct
coding for the new document is the same as the most similar one in the Training Set.
Negative Predictive Value (“NPV”)
: The fraction (Proportion) of documents in a collection
that are not identified as Relevant by a search or review effort, and which are indeed Not
Relevant. Akin to Recall, when the meanings of Relevant and Non-Relevant are transposed.
Non-Relevant / Not Relevant
: In Information Retrieval, a document is considered Non-
Relevant (or Not Relevant) if it does not meet the Information Need of the search or review
effort. The synonym “irrelevant” is rarely used in Information Retrieval.
Normal / Gaussian Distribution:
The “bell curve” of classical statistics. The number of
Relevant documents in a sample tends to obey a normal distribution, provided the sample is large
enough to contain a substantial number of Relevant and Non-Relevant documents. In this
situation, Gaussian Estimation is reasonably accurate. If there are few Relevant documents in
the sample, the Binomial Distribution better characterizes the number of Relevant documents in
the sample, and Binomial Estimation is more appropriate.
The set of documents that are not identified to be potentially Relevant by a search or
review process; the set of documents that have been labeled as Not Relevant by a search or
A representation of the relationships among words and their meanings that is richer
than a Taxonomy. For example, an Ontology can represent the fact that a wheel is a part of a
bicycle, or that gold is yellow, and so on.
: The science of designing computer Algorithms to recognize natural
phenomena like parts of speech, faces, or spoken words.
: The Probability that, if one reviewer codes a document as Relevant, a
second independent reviewer will also code the document as Relevant. Empirical studies show
that Positive Agreement Rates of 70% are typical, and Positive Agreement rates of 80% are rare.
Positive Agreement should not be confused with Agreement (which is a less informative
measure) or Overlap (which is a numerically smaller but similarly informative measure). Under
the assumption that the two reviewers are equally likely to err, Overlap is roughly equal to the
square of Positive Agreement. That is, if Positive Agreement is 70%, Overlap is roughly 70% x
70% = 49%.
Positive Predictive Value (“PPV”)
Precision. Positive Predictive Value is a term of art in
Signal Detection Theory; Precision is the equivalent term in Information Retrieval.
The fraction of documents identified by a search or review effort that are Relevant to
the information request. Precision must be balanced against Recall. Also referred to as Positive
A Technology Assisted Review process involving the use of a Machine
Learning Algorithm to distinguish Relevant from Non-Relevant documents, based on Subject
Matter Expert(s) coding of a Training Set of documents. See
Supervised Learning and Active
The fraction of documents in a collection that are Relevant to an Information Need.
Also known as Yield or Richness.
Probabilistic Latent Semantic Analysis:
A variant on Latent Semantic Analysis based on
conditional Probability rather than correlation.
: The fraction (proportion) of times that a particular outcome would occur,
should the same action be repeated under the same conditions an infinite number of times.
For example, if one were to flip a fair coin, the Probability of it landing “heads”
one repeats this action indefinitely, the fraction
of times that the coin lands “heads” will become indistinguishable from 50%. If one were to flip two fair coins, the Probability of both landing “heads” is one-quarter, or 25%.
: The fraction of a set of documents having some particular property (typically
: Pursuant to Federal Rule of Civil Procedure 26(b)(2)(C), the legal doctrine
that electronically stored information may be withheld from production if the cost and burden of
producing it exceeds its potential value to the resolution of the matter.
A method to ensure, after the fact, that a search or review effort has
achieved reasonable results.
Ongoing methods to ensure, during a search or review effort, that reasonable
results are being achieved.
Random Sample / Random Sampling
: A subset of the Document Population selected by a
method that is equally likely to select any document from the population for inclusion in the
sample. Random Sampling is the basis of Statistical Estimation.
The fraction of Relevant documents that are identified by a search or review effort.
Recall must be balanced against Precision.
: The curve representing the tradeoff between Recall and Precision for a
given search or review effort, depending on the chosen Cutoff value.
Receiver Operating Characteristic Curve (“ROC”)
: In Signal Detection Theory, a graph of
the tradeoff between True Positive Rate and False Positive Rate, as the Cutoff is varied.
An Active Learning process in which the documents with the highest
likelihood of Relevance are coded by a human, and added to the training set.
A search method in which the results are ranked from the most likely
through the least likely to be Relevant to an Information Need. Google search is an example of
Relevant / Relevance:
In Information Retrieval, a document is considered Relevant if it meets
the Information Need of the search or review effort.
A document that is Relevant to the Information Need expressed by a particular
request for production or subpoena in a civil, criminal, or regulatory matter.
The fraction (or Proportion) of documents in a collection that are Relevant. Also
known as Prevalence or Yield.
A set of rules created by an expert to emulate the human decision-making process
for the purposes of classifying documents in the context of e-discovery.
The number of documents drawn at random that are used to calculate a Statistical
Search Term: See
The initial Training Set provided to the learning Algorithm in an Active Learning
process. The documents in the Seed Set may be selected based on Random Sampling or
Judgmental Sampling. Some commentators use the term more restrictively to refer only to
documents chosen using Judgmental Sampling. Other commentators use the term generally to
mean any Training Set, including the final Training Set in Iterative Training, or the only Training
Set in non-iterative training.
Signal Detection Theory:
Co-invented at the same time and in conjunction radar, the science of
distinguishing true observations from spurious ones. Signal Detection Theory is widely used in
radio engineering and medical diagnostic testing. The terms True Positive, True Negative, False
Positive, False Negative, and Area Under the ROC Curve (“AUC”) all arise from Signal
Significance [of a statistical test]:
The confirmation, with a given Confidence Level, of a prior
hypothesis, using a Statistical Estimate. The Statistical Estimate is said to be “statistically
significant” if all values within the Confidence Interval for the desired Confidence Level
(typically 95%) are consistent with the hypothesis being true, and inconsistent with it being false.
For example, if our hypothesis is that fewer than 300,000 documents are Relevant, and a
Statistical Estimate shows that, 290,000 documents are Relevant, plus or minus 5,000
documents, we say that the result is significant. On the other hand, if the Statistical Estimate
shows that 290,000 documents are Relevant, plus or minus 15,000 documents, we say that the
result is not significant, because the Confidence Interval includes values (i.e
., the values between
300,000 and 305,000) that contradict the hypothesis.
Statistical Estimate / Statistical Estimation:
The act of estimating the Proportion of a
Document Population that have a particular characteristic, based on the Proportion of a Random
Sample that have the same characteristic. Methods of Statistical Estimation include Gaussian
Estimation and Binomial Estimation.
A method in which a sample of the Document Population is drawn at
random, so that statistical properties of the sample may be extrapolated to the collection.
In Keyword or Boolean Search, or Feature Engineering, the process of equating all
forms of the same root word. For example, the words “stem,” “stemming,” “stemmed,” and
“stemmable” would all be treated as equivalent, and would each yield the same result when used
as a Search Terms In some search systems, stemming is implicit and in others, it must be made
explicit through particular search syntax.
Subject Matter Expert(s):
An individual (typically, but not necessarily, a senior attorney) who
is familiar with the Information Need and can render an authoritative opinion as to whether a
document is Relevant or not.
A Machine Learning method in which the learning Algorithm infers how
to distinguish between Relevant and Non-Relevant documents using a Training Set. Supervised
Learning can be a stand-alone process, or the final step in Active Learning,
Support Vector Machine:
A state-of-the-art Supervised Learning Algorithm for machine
learning that separates Relevant from Non-Relevant documents using geometric methods (i.e.
geometry). Each document is considered to be a point in [hyper]space, whose coordinates are
determined from the Features contained in the document. The Support Vector Machine finds a
[hyper]plane that best separates the Relevant from Non-Relevant Training documents.
Documents outside the Training Set (i.e.
, uncoded documents from the Document Population) are then classified as Relevant or not, depending on which side of the [hyper]plane they fall on. Although a Support Vector Machine does not calculate a Probability of Relevance, one may infer that the classification of documents closer to the [hyper]plane is more doubtful than those that are far from the [hyper]plane. In contrast the Naïve Bayes, and similar to Logistic Regression, Support Vector Machines identify Features that discriminate between Relevant and Non-Relevant documents.
A hierarchical organization scheme that arranges the meanings of words into
classes and subclasses. For example, Fords and Chryslers are cars; cars, trucks, and bicycles are
vehicles, and vehicles, aircraft, and ships are modes of transportation.
Technology Assisted Review (“TAR”):
A process for prioritizing or coding a collection of
electronic documents using a computerized system that harnesses human judgments of one or
more Subject Matter Expert(s) on a smaller set of documents and then extrapolates those
judgments to the remaining Document Population. Some TAR methods use Algorithms that
determine how similar (or dissimilar) each of the remaining documents is to those coded as
Relevant (or Non-Relevant, respectively) by the Subject Matter Experts(s), while other TAR
methods derive systematic rules that emulate the expert(s) decision-making process. TAR
systems generally incorporate statistical models and/or sampling techniques to guide the process
and to measure overall system effectiveness.
Term Frequency and Inverse Document Frequency (“TF-IDF”)
: An enhancement to the
Bag of Words model in which each word has a weight based on term frequency – the number of
times the word appears in the document – and inverse document frequency – the number of
documents in which the word appears.
In Keyword or Boolean Search, replacing a single Search Term by a list
of its synonyms, as listed in a thesaurus.
A sample of documents coded by one or more Subject Matter Expert(s) as
Relevant or Non-Relevant, from which a Machine Learning Algorithm then infers how to
distinguish between Relevant and Non-Relevant documents beyond those in the Training Set.
The Text REtrieval Conference, sponsored by the National Institute of Standards and
Technology (“NIST”), which has run since 1992 to “support research within the information
retrieval community by providing the infrastructure necessary for large-scale evaluation of text
retrieval methodologies. In particular, the TREC workshop series has the following goals: to
encourage research in information retrieval based on large test collections; to increase
communication among industry, academia, and government by creating an open forum for the
exchange of research ideas; to speed the transfer of technology from research labs into
commercial products by demonstrating substantial improvements in retrieval methodologies on
real-world problems; and to increase the availability of appropriate evaluation techniques for use
by industry and academia, including development of new evaluation techniques more applicable
to current systems.”
TREC Legal Track:
From 2006 through 2011, TREC included a Legal Track, which sought
“to assess the ability of information retrieval techniques to meet the needs of the legal profession
for tools and methods capable of helping with the retrieval of electronic business records,
principally for use as evidence in civil litigation.”
True Negative (“TN”):
A Non-Relevant document that is correctly identified as Non-Relevant
by a search or review effort.
True Negative Rate (“TNR”):
The fraction (or Proportion) of Non-Relevant documents that
are correctly identified as Non-Relevant by a search or review effort.
True Positive (“TP”):
A Relevant document that is correctly identified as Relevant by a search
or review effort.
True Positive Rate (“TPR”):
The fraction (or Proportion) of Relevant documents that are
correctly identified as Relevant by a search or review effort.
An Active Learning approach in which the Machine Learning
Algorithm selects the documents as to which it is least certain about Relevance, for coding by the Subject Matter Expert(s), and addition to the Training Set.
A Machine Learning method in which the learning Algorithm infers
categories of similar documents without any training by a Subject Matter Expert. Examples of
Unsupervised Learning methods include Clustering and Near-Duplicate Detection.
The act of confirming that a process has achieved its intended purpose. Validation
may be statistical or judgmental.
: De-duplication within a custodian; identical copies of a document
held by different custodians are not De-duplicated. (Cf.
The fraction (or Proportion) of a Document Population that are Relevant. Also known as
Prevalence or Richness.
The American Journal of Drug and Alcohol Abuse , 37:1–11, 2011Copyright © Informa Healthcare USA, Inc. ISSN: 0095-2990 print / 1097-9891 onlineDOI: 10.3109/00952990.2010.540279 Pharmacokinetic drug interactions and adverse consequences between psychotropic medications and pharmacotherapy for the treatment of opioid dependence Ali S. Saber-Tehrani, M.D., Robert Douglas Bruce, M.D., M.A., M
Chapter 4 Conventional Medical Therapies “Today’s standard, AMA-approved medicine is rooted in treating symptoms, rather than causes. Its dependence on drugs and surgery is ruinously expensive to patients, insurance companies, “Why I Left Orthodox Medicine” Conventional medical treatments for FMS and CFS is a controversial topic. Consider the following statements