What is Text? Content-based Structure
ATHENS, Greece (Ap) A strong earthquake shook theAegean Sea island of Crete on Sunday but caused no in-
• Describe the strength and the impact of an
juries or damage. The quake had a preliminary magni-
tude of 5.2 and occurred at 5:28 am (0328 GMT) on the
sea floor 70 kilometers (44 miles) south of the Cretanport of Chania. The Athens seismological institute said
the temblor’s epicenter was located 380 kilometers (238
miles) south of the capital. No injuries or damage werereported. What is Text? Domain-dependent Text Structures
A product of structural relations (coherence)
S1: A strong earthquake shook the Aegean Sea island ofCrete on Sunday
Regina Barzilay
S2: but caused no injuries or damage.
S3: The quake had a preliminary magnitude of 5.2
March 1, 2003 Analogy with Syntax Motivation
Extract a representative subsequence from a set ofsentences
Domain-independent Theory of Sentence Structure
• Fixed set of word categories (nouns, verbs, . . .)
Find an answer to a question in natural language
• Fixed set of relations (subject, object, . . .)
Order a set of information-bearing items into a coherent
Find the best translation taking context into account
Rhetorical Structure Two Approaches to Text Structure – Content-based models – Rhetorical models – Rhetorical Structure Theory (Next Class) Argumentative Zoning Motivation
• Scientific articles exhibit (consistent across
Many of the recent advances in Question Answering have
followed from the insight that systems can benefit fromby exploiting the redundancy in large corpora. – BACKGROUND
Brill et al. (2001) describe using the vast amount of
– OWN CONTRIBUTION
data available on the WWW to achieve impressive per-
– RELATION TO OTHER WORK
formance . . . The Web, while nearly infinite in content, is not a com-
• Automatic structure analysis can benefit:
plete repository of useful information . . . – Q&A
In order to combat these inadequacies, we propose a
– summarization
strategy in which in information is extracted from . . . – citation analysis Today: Domain-Specific Models Argumentative Zoning
BACKGROUNDMany of the recent advances in Question Answering have followed
from the insight that systems can benefit from by exploiting theredundancy
– Argumentative Zoning of Scientific Articles
Brill et al. (2001) describe using the vast amount of data available onthe WWW to achieve impressive performance . . . – Supervised (Duboue&McKeown, 2001)
The Web, while nearly infinite in content, is not a complete repositoryof useful information . . . – Unsupervised (Barzilay&Lee, 2004)
OWN CONTRIBUTIONIn order to combat these inadequacies, we propose a strategy in whichin information is extracted from . . . Examples Features
We have proposed a method of clustering words
Section 2 describes three parsers which are . . . Contrast
However, no method for extracting the relation-ship from superficial linguistic expressions was
• Lexical Features (“other researchers claim that”)
Approach Kappa Statistics
(Siegal&Castellan, 1998; Carletta, 1999)Kappa controls agreement P (A) for chance agreement
• Goal: Rhetorical segmentation with labeling
– Own work: aim, own, textual – Background – Other Work: contrast, basis, other Supervised Content Modeling Semantic Sequence
• Goal: Find types of semantic information
characteristic to a domain and ordering constraints
age, gender, pmh, pmh, pmh, pmh, med-preop,
med-preop, med-preop, drip-preop, med-preop,
• Approach: find patterns in a set of transcripts
ekg-preop, echo-preop, hct-preop, procedure, . . . Annotated Transcript
He is 58-year-old male. History is significant for Hodgkin’s disease,
treated with . . . to his neck, back and chest. Hyperspadias, BPH,
hiatal hernia and proliferative lymph edema in his right arm. No IV’spmh
or blood pressure down in the left arm. Medications — Inderal, Lopid,
Pepcid, nitroglycerine and heparin. EKG has PAC’s. . . . med-preop drip-preop
Example of Learned Pattern Content Models
• Content models represent topics and their ordering
Topics: “strength”, “location”, “casualties”, . . .
Order: “casualties” prior to “rescue efforts”
• Assumption: Patterns in content organization are
Pattern Detection Evaluation
Analogous to motif detectionT1: A B C D F A A B F DT2: F C A B D D F F
Similarity in Domain Texts Computing Content Model
TOKYO (AP) A moderately strong earthquake with a preliminary magni-
tude reading of 5.1 rattled northern Japan early Wednesday, the CentralMeteorological Agency said. There were no immediate reports of casual-
ties or damage. The quake struck at 6:06 am (2106 GMT) 60 kilometers(36 miles) beneath the Pacific Ocean near the northern tip of the main
• State-transitions represent ordering constraints
island of Honshu. . . . ATHENS, Greece (Ap) A strong earthquake shook the Aegean Sea islandof Crete on Sunday but caused no injuries or damage. The quake had
a preliminary magnitude of 5.2 and occurred at 5:28 am (0328 GMT)on the sea floor 70 kilometers (44 miles) south of the Cretan port ofChania. The Athens seismological institute said the temblor’s epicenterwas located 380 k ilometers (238 miles) south of the capital. . . . Similarity in Domain Texts Narrative Grammars
TOKYO (AP) A moderately strong earthquake with a preliminary magni-tude reading of 5.1 rattled northern Japan early Wednesday, the CentralMeteorological Agency said. There were no immediate reports of casual-ties or damage. The quake struck at 6:06 am (2106 GMT) 60 kilometers
• Propp (1928): fairy tales follow a “story grammar”
(36 miles) beneath the Pacific Ocean near the northern tip of the mainisland of Honshu. . . .
• Barlett (1932): formulaic text structure facilities
ATHENS, Greece (AP) A strong earthquake shook the Aegean Sea island
of Crete on Sunday but caused no injuries or damage. The quake hada preliminary magnitude of 5.2 and occurred at 5:28 am (0328 GMT)on the sea floor 70 kilometers (44 miles) south of the Cretan port of
• Wray (2002): texts in multiple domains exhibit
Chania. The Athens seismological institute said the temblor’s epicenter
was located 380 k ilometers (238 miles) south of the capital. No injuriesor damage were reported. Initial Topic Induction Estimating Emission Probabilities
Agglomerative clustering with cosine similarity measure
(Iyer&Ostendorf:1996,Florian&Yarowsky:1999, Barzilay&Elhadad:2003)
The Athens seismological institute said the temblor’s epicenter was lo-cated 380 kilometers (238 miles) south of the capital.
Seismologists in Pakistan’s Northwest Frontier Province said the temblor’s
epicenter was about 250 kilometers (155 miles) north of the provincialcapital Peshawar.
• Estimation for the “insertion” state:
The temblor was centered 60 kilometers (35 miles) northwest of theprovincial capital of Kunming, about 2,200 kilometers (1,300 miles)
southwest of Beijing, a bureau seismologist said. Model Construction From Clusters to States
• Each large cluster constitutes a state
• Agglomerate small clusters into an “insert” state
• Determining states, emission and transition
Viterbi re-estimation Information Ordering: Algorithm
• Decode the training data with Viterbi decoding
• Use the new clustering as the input to the parameter
Estimating Transition Probabilities Application: Information Ordering – Text summarization – Natural Language Generation
g(ci, cj) is a number of adjacent sentences (ci, cj)
“get marry” prior to “give birth” (in some domains)
Summarization: Algorithm Baselines for Ordering
Input: source textTraining data: parallel corpus of summaries and sourcetexts (aligned)
• “Straw” baseline: Bigram Language model
• Employ Viterbi on source texts and summaries
• “State-of-the-art” baseline: (Lapata:2003)
• Compute state likelihood to generate summary
– represent a sentence using lexico-syntactic – compute pairwise ordering preferences – find optimally global order
• Given a new text, decode it and extract sentences
Application: Summarization Evaluation: Data – specify types of important information – use information extraction to identify this
• Domain-independent summarization: (Kupiec et
– represent a sentence using shallow features – use a classifier Baselines for Summarization Results: Summarization
• “Straw” baseline: n leading sentences
• “State-of-the-art” Kupiec-style classifier:
– Sentence representation: lexical features and – Classifier: BoosTexter Results: Ordering Ordering: Learning Curve Summarization: Learning Curve
PACKUNGSBEILAGE FÜR DIE ÖFFENTLICHKEIT Bezeichnung Maglid Zusammensetzung Aluminii oxidum hydricum 200 mg - Magnesii hydroxydum 200 mg - Macrogolum 4000 - Magnesii stearas - Talcum - Menthae piperitae aetheroleum - Saccharum q.s. pro tabletta una. Pharmazeutische Form und Packung Tabletten zum Lutschen oder zum Kauen. Packung mit 48 Tabletten unter Blisterpackung. Abgab
HIV and Heart HealtH It’s no secret that both HIV and antiretroviral treatment can cause problems that can increase the risk of cardiovascular disease, including heart attacks and strokes. However, QUICK TIPS there are many ways to protect your heart if you’re HIV positive, including selecting antiretrovirals carefully, monitoring your lipid levels, and doing your best to control class