People.cs.umass.edu

Improving Morphology Induction by Learning Spelling Rules or shutt.ing. In the first case, shutting will be correctly identi-fied as sharing a stem with words such as shut and shuts, but Unsupervised learning of morphology is an impor- will not share a suffix with words such as walking and run- tant task for human learners and in natural lan- ning. In the second case, the opposite will be true. In this guage processing systems. Previous systems fo- paper, we present a Bayesian model of morphology that iden- cus on segmenting words into substrings (taking ⇒ tifies the latent underlying morphological analysis of each tak.ing), but sometimes a segmentation-only anal- word (shut+ing)2 along with spelling rules that generate the ysis is insufficient (e.g., taking may be more ap- propriately analyzed as take+ing, with a spelling Most current systems for unsupervised morphological rule accounting for the deletion of the stem-final analysis in NLP are based on various heuristic methods and e). In this paper, we develop a Bayesian model perform segmentation only [Monson et al., 2004; Freitag, for simultaneously inducing both morphology and 2005; Dasgupta and Ng, 2006]; [Dasgupta and Ng, 2007] also infers some spelling rules. Although these can be effective, spelling rules improves performance over the base- our goal is to investigate methods which can eventually be built into larger joint inference systems for learning multi-ple aspects of language (such as morphology, phonology, and syntax) in order to examine the kinds of structures and bi-ases that are needed for successful learning in such a system.
In natural language, words are often constructed from multi- For this reason, we focus on probabilistic models rather than ple morphemes, or meaning-bearing units, such as stems and suffixes. Identifying the morphemes within words is an im- Previously, [Goldsmith, 2006] and [Goldwater and John- portant task both for human learners and in natural language son, 2004] have described model-based morphology induc- processing (NLP) systems, where it can improve performance tion systems that can account for some variations in morphs on a variety of tasks by reducing data sparsity [Goldwater and caused by spelling rules. Both systems are based on the Min- McClosky, 2005; Larkey et al., 2002]. Unsupervised learning imum Description Length principle and share certain weak- of morphology is particularly interesting, both from a cog- nesses that we address here. In particular, due to their com- nitive standpoint (because developing unsupervised systems plex MDL objective functions, these systems incorporate may shed light on how humans perform this task) and for special-purpose algorithms to search for the optimal morpho- NLP (because morphological annotation is scarce or nonex- logical analysis of the input corpus. This raises the possibility istent in many languages). Existing systems, such as [Gold- that the search procedures themselves are influencing the re- smith, 2001] and [Creutz and Lagus, 2005], are relatively sults of these systems, and makes it difficult to extend the successful in segmenting words into constituent morphs (es- underlying models or incorporate them into larger systems sentially, substrings), e.g. reporters ⇒ report.er.s. However, other than through a strict 1-best pipelined approach. Indeed, strategies based purely on segmentation of observed forms each of these systems extends the segmentation-only system make systematic errors in identifying morphological relation- of [Goldsmith, 2001] by first using that system to identify a ships because many of these relationships are obscured by segmentation, and then (in a second step), finding spelling spelling rules that alter the observed forms of words.1 For rules to simplify the original analysis. In contrast, the model example, most English verbs take -ing as the present continu- presented here uses standard sampling methods for inference, ous tense ending (walking), but after stems ending in e, the e and provides a way to simultaneously learn both morpholog- is deleted (taking), while for some verbs, the final stem con- ical analysis and spelling rules, allowing information from sonant is doubled (shutting, digging). A purely segmenting each component to flow to the other during learning. We system will be forced to segment shutting as either shut.ting 1Human learners encounter an analogous problem with phono- 2In what follows, we use ‘+’ to indicate an underlying morpheme logical rules that alter the observed forms of spoken words.
boundary, and ‘.’ to indicate a surface segmentation.
show that the addition of spelling rules allows our model to outperform the earlier segmentation-only Bayesian model of [Goldwater et al., 2006], on which it is based.
In the remainder of this paper, we begin by reviewing the baseline model from [Goldwater et al., 2006]. We then de- scribe our extensions to it and the sampler we use for infer- ence. We present experiments demonstrating that the com- bined morphology-spelling model outperforms the baseline.
Finally, we discuss remaining sources of error in the system and how we might address them in the future.
Figure 1: Example output from the baseline system. Stem- final e is analyzed as a suffix (or part of one), so that themorphosyntactic relationships between pairs such as (aban- We take as our baseline the simple model of morphology de- don,abate) and (abandons, abates) are lost.
scribed in [Goldwater et al., 2006], which generates a wordw in three steps: 1. Choose a morphological class c for w.
2. Choose a stem t conditioned on c.
P (ci | c−i) · P (ti | t−i, c−i, ci) 3. Choose a (possibly empty) suffix f conditioned on c.
Since t and f are assumed to be conditionally independent where N is the total number of words and the notation x−i 1 . . . xi−1. The probability of each factor is com- puted by integrating over the parameters associated with that all stem-suffix combinations that can be concatenated to formw. This model is of course simplistic in its assumption that words may only consist of two morphs; however, for the test set of English verbs that was used by [Goldwater et al., 2006], two morphs is sufficient. A similar model that allows multiple morphs per word is described in [Goldsmith, 2001].
Goldwater et al. present the model above within a Bayesian is the number of occurrences of c in c−i, n(−i) framework in which the goal is to identify a high-probability is the length of c−i (= i − 1), and C is the total number of sequence of classes, stems, and suffixes (c, t, f ) given an ob- possible classes. The value of the integration is a standard served sequence of words w. This is done using Bayes’ rule: result in Bayesian statistics [Gelman et al., 2004], and can beused (as Goldwater et al. do) to develop a Gibbs sampler for P (c, t, f | w) ∝ P (w | c, t, f )P (c, t, f ) inference. We defer discussion of inference to Section 4.
While the model described above is effective in segment- Note that the likelihood P (w | c, t, f ) can take on only two ing verbs into their stems and inflectional suffixes, such a possible values: 1 if the observed words are consistent witht segmentation misses certain linguistic generalizations, as de- scribed in the introduction and illustrated in Figure 1. In order tion over analyses P (c, t, f ) is crucial to inference. As in to identify these generalizations, it is necessary to go beyond other model-based unsupervised morphology learning sys- simple segmentation of the words in the input. In the follow- tems [Goldsmith, 2001; Creutz and Lagus, 2005], Goldwa- ing section, we describe an extension to the above generative ter et al. assume that sparse solutions – analyses containing model in which spelling rules apply after the stem and suf- fewer total stems and suffixes – should be preferred. This is fix are concatenated together, so that the stem and suffix of done by placing symmetric Dirichlet priors over the multino- each word may not correspond exactly to a segmentation of mial distributions from which c, t, and f are drawn: To extend the baseline model, we introduce the notion of a spelling rule, inspired by the phonological rules of Chomsky where θc, θt|c, and θf|c are the multinomial parameters for and Halle [1968]. Each rule is characterized by a transfor- classes, stems, and suffixes, and κ, τ , and φ are the respec- mation and a context in which the transformation applies. We tive Dirichlet hyperparameters. We discuss below the signif- develop two models, one based on a two-character context icance of the hyperparameters and how they can be used to formed with one left context character and one right context favor sparse solutions. Under this model, the probability of a character, and the other based on a three-character context with an additional left context character. We assume that Computing the conditional probability of A(wi) is straight- transformations only occur at morpheme boundaries, so the forward because the Dirichlet-multinomial distributions we context consists of the final one or two characters of the (un- have constructed our model from are exchangeable: the prob- derlying) stem, and the first character of the suffix. For ex- ability of a set of outcomes does not depend on their ordering.
ample, shut+ing, take+ing, sleep+s have contexts ut i, ke i, We can therefore treat each analysis as though it is the last one ep s. Transformations can include insertions, deletions, or in the data set, and apply the same integration over parame- empty rules, and always apply to the position immediately ters that led to Equation 5. The full sampling equations for preceding the morpheme boundary, i.e. deletions delete the stem-final character and insertions insert a character follow- Our model contains a number of hyperparameters. Rather ing the stem-final character.3 So the rule ε → than setting these by hand, we optimize them by maximiz- ing the posterior probability of each hyperparameter given all ε / ep s produces sleeps. Our new model extends other variables in the model. For example, to maximize τ we the baseline generative process with two additional steps: 4. Choose the rule type y (insertion, deletion, empty) con- ditioned on x(f, t), the context defined by t and f .
τ = argmax P (τ |κ, τ, φ, ρ, η, η, ηE, ηI , ηD, c, t, f, y, r) 5. Choose a transformation r conditioned on y and x(f, t).
which gives us the following joint probability: P (c)P (t|c)P (f |c)P (y|x(f, t))P (r|y, x(f, t)) (6) As above, we place Dirichlet priors over the multinomial which can be optimized iteratively using the following distributions from which y and r are chosen. Our expec- tations are that most rules should be empty (i.e., observedforms are usually the same as underlying forms), so we use a non-symmetric Dirichlet prior over rule types, with η = D , ηI , ηE ) being the hyperparameters over insertion, dele- tion, and empty rules, where ηE is set to a much larger value than ηD and ηI (we discuss this in more detail below). In ad-dition, at most one or two different transformations should oc- In this section, we describe the experiments used to test our cur in any given context. We encourage this by using a small morphological induction system. We begin by discussing our value for ρ, the hyperparameter of the symmetric Dirichlet input data sets, then present two distinct evaluation methods, and finally describe the results of our experiments.
For input data we use the same data set used by [Goldwater We sample from the posterior distribution of our model et al., 2006], the set of 7487 English verbs found in the Penn P (c, t, f , y, r | w) using Gibbs sampling, a standard Markov Wall Street Journal (WSJ) corpus [Marcus et al., 1993]. En- chain Monte Carlo (MCMC) technique [Gilks et al., 1996].
glish verbs provide a good starting point for evaluating our Gibbs sampling involves repeatedly sampling the value of system because they contain many regular patterns, but also a each variable in the model conditioned on the current values number of orthographic transformations. We do not include of all other variables. This process defines a Markov chain frequency information in our input corpus; this is standard whose stationary distribution is the posterior distribution over in morphology induction and has both psychological [Pierre- model variables given the input data. Because the variables humbert, 2003] and mathematical justifications [Goldwater et that define the analysis of a given word are highly dependent (only certain choices of t, f, y and r are consistent), we useblocked sampling to sample all variables for a single word at once. That is, we consider each word wi in the data in turn, Although evaluation measures based solely on a gold stan- consider all possible values of (c, t, f, y, r) comprising a con- dard surface segmentation are sometimes used, it should be sistent analysis A(wi) of wi, and compute the probability of clear from our introduction that this kind of measure is not each full analysis conditioned on the current analyses of all sufficient for our purposes. Instead, we use two different other words. We then sample an analysis for the current word evaluation measures based on the underlying morphological according to this distribution and move on to the next word.
structure of the data. Both of our evaluation methods use After a suitable burn-in period, the sampler converges to sam- the English portion of the CELEX database [Baayen et al., pling from the posterior distribution.
1995] to determine correctness. It contains morphological 3Permitting arbitrary substitution rules allows too much freedom analyses of 160,594 different inflected wordforms based on to the model and yields poor results; in future work we hope to 52,446 uninflected lemmata. Each morphological analysis achieve better results by using priors to constrain substitutions in includes both a surface segmentation as well as an abstract morphosyntactic analysis which provides the functional role P (A(wi) = (c, t, f, y, r) | A(w−i), κ, τ, φ, η, ρ) ∝ I(wi = r(t.f )) · P (c, t, f, y, r | A(w−i), κ, τ, φ, η, ρ)∝ P (c | c−i, κ) · P (t | t−i, c, τ ) · P (f | f−i, c, φ) · P (y | y−i, t, f , η) · P (r | r−i, t, f , y, ρ) Figure 2: Equations used in sampling to compute the probability of of the analysis A(wi) of wi, conditioned on A(w−i),the analyses of all other words in the data set. We use the notation x−i here to indicate x1 . . . xi−1, xi+1 . . . xN . I(.) is afunction taking on the value 1 when its argument is true, and 0 otherwise. κ, τ, φ, η, ρ are the hyperparameters for the Dirichletdistributions associated with classes, stems, suffixes, rule types, and rules, respectively; and C, T, F, R specify the total numberof possible values for classes, stems, suffixes, and rules. Note that for y = delete or empty, there is only one possible rule, soR = 1 and the final factor cancels out. For y = insert, R = 26.
By comparing the pairwise relationships in the system out- put and the gold standard, we can compute pairwise preci- sion (PP) as the proportion of proposed pairs that are correct, and pairwise recall (PR) as the proportion of true pairs that are correctly identified. This is reported separately for stems and suffixes along with the F-Measures of each, calculated as Our second evaluation method is designed to more directly test the correctness of underlying forms by using the analyses provided in CELEX to reconstruct an underlying form (UF) for each surface form. To identify the underlying stem for a word, we use the lemma ID number, which is the same for all inflected forms and specifies the canonical dictionary form,which is identical to the stem. To identify the underlying suf- Table 1: An example illustrating the resources used for eval- fix, we map each of the suffix functional labels to a canonical uation and our two scoring methods. We suppose that Found string representation. Specifically, pe ⇒ ing, a1S ⇒ ed, is the analysis found by the system. CX string is the segmen- e3S ⇒ s, and all other labels are mapped to the empty string tation of the surface form given in CELEX. CX abstract is ε. When the CELEX surface segmentation of an inflected the abstract morpheme analysis given in CELEX (with each form has an empty suffix, indicating an irregular form such stem represented by a unique ID, and each suffix represented as forgot.ε, we use the surface segmentation as the UF. We by a code such as pe for present participle), used to compute can then compute underlying form accuracy (UFA) for stems pairwise precision (PP) and pairwise recall (PR). UF string is as the proportion of found stems that match those in the UFs, the underlying string representation we derived based on the two CELEX representations (see text), used to compute UFaccuracy (UFA). UF strings that do not match those found by the system are shown in bold. In this example, scores for Our inference procedure alternates between sampling the stems are 10/13 (UFA), 8/10 (PP), and 8/15 (PR). Scores for variables in the model and updating the hyperparameters. For suffixes are 11/13 (UFA), 9/12 (PP), and 9/16 (PR).
both the baseline and spelling-rule system, we ran the algo-rithm for 5 epochs, with each epoch containing 10 iterations of any inflectional suffixes. For example, the word walking of sampling and 10 iterations of hyperparameter updates. Al- is segmented as walk.ing, and is accompanied by a pe label though it is possible to automatically learn values for all of the to denote the suffix’s role in marking it as a present tense (e) hyperparameters in the model, we chose to set the values of participle (p). See Table 1 for further examples.
the hyperparameters over rule types by hand to reflect our in- Our first evaluation method is based on the pairwise re- tuitions that empty rules should be far more prevalent than in lational measure used in the recent PASCAL challenge on insertions or deletions. That is, the hyperparameter for empty unsupervised morpheme analysis.4 Consider the proposed rules ηE should be relatively high, while the hyperparameters analysis walk+ing and its corresponding gold standard en- determining insertion and deletion rules, ηI and ηD, should try 50655+pe. Assuming that this analysis is correct, any be low (and, for simplicity, we assume they are equal). Re- other correct analysis that shares the stem walk should also sults reported here use ηE = 5, ηI = ηD = .001 (although share the same stem ID 50655, and likewise for the suffixes.
other similar values yield similar results). All other hyperpa-rameters were learned.
4http://www.cis.hut.fi/morphochallenge2007/ The remaining model parameters are either determined by and symbolizes+d with the erroneous s-deletion rule (Figure 4), so that they share the same stem. These analyses would not be likely using a larger data set.
Second, the presence of derivational verbs in the data is a contributing factor because they are not analyzed correctly in the inflectional verbs section of CELEX, which forms our gold standard. Consider that the baseline provides the most succinct analysis of suffixes, positing just four (-ε, -s, -ed, and -ing), whereas the three-character-context model induces five (the same four with the addition of -d). The two-character- context model, the worst-performing system on suffix UFA, learns an additional five suffixes (-e, -es, -n, -ize, and -ized).
Not all of these additional forms are unreasonable; -ize and -nare both valid suffixes, and -ized is the remainder of a correct Figure 3: Induced Analyses. Incorrect analyses are shaded.
segmentation. However, because suffixes like -ize are deriva-tional (they change the part-of-speech of the root they attach to), they are not considered as part of the canonical dictionary of our gold standard. In this situation the UFA metric there- fore provides an upper-bound for the baseline, but a lower- The pair-wise metrics are also susceptible to this problem, but continue to support the conclusions reached previously on overall system performance. The baseline slightly out- performs the three-character-context model in stem PP, but compares quite poorly in stem PR, and in stem PF. It again performs better than the augmented models on suffix tasks.
Figure 4: Commonly Induced Rules by Frequency.
Worth noting is that the errors made according to this metricare a small set of very pervasive mistakes. For instance, im-properly segmenting -ed suffixes as -d suffixes or segmenting the data or set by hand. For the WSJ verbs data set the num- a stem-final e as its own suffix together contribute to more ber of possible stems, T = 7, 306, 988, and the number of than half of all erroneous suffixes proposed by this model.
possible suffixes, F = 5, 555, are calculated by enumerating In addition to improved performance on the morphology all possible segmentation of the words in the data set and ac- induction task, our system also produces a probabalistic rep- counting for every possible rule transformation. We set the resentation of the phonology of a language in the spelling number of classes C = 1 and the minimum stem length to rules it learns. The most frequently learned rules (Table 4) are three chararacters. Enforcing a minimum stem length ensures largely correct, with just two spurious rules induced. While that even in the case of the most minimal stem and the appli- many of these are linguistically redundant because of the cation of an insertion rule, the underlying stem will still have overspecification of their contexts, most refer to valid, desir- two characters to form the left context.
able orthographic rules. Examples of these are e-deletion invarious contexts (state+ing ⇒ stating ), e-insertions (pass+s ⇒ passes), and consonant doubling when taking the -ing suf- Quantitative results for the two systems are shown in Table fix (forget+ing ⇒ forgetting, spam+ing ⇒ spamming).
2, with examples of full analyses shown in Figure 3 and themost commonly inferred spelling rules in Figure 4. Overall, the augmented models dramatically outperform the baselineon the UFA stem metric, which is is not surprising consider- As we noted in the introduction, one of the difficulties of un- ing that it is the introduction of rules that allows these mod- supervised morphology induction is that spelling rules often els to correctly capture stems that may have been improperly act to obscure the morphological analyses of the observed segmented in the baseline (Figure 1).
words. A few previous model-based systems have tried to However, the baseline performs better on suffix UFA by a deal with this, but only by first segmenting the corpus into fair margin. There are at least two contributing factors caus- morphs, and then trying to identify spelling rules to sim- ing this. First, the addition of spelling rules allows the model plify the analysis. To our knowledge, this is the first work to explain some suffixes in alternate undesirable ways. For to present a probabilistic model using a joint inference proce- instance, the -ed suffix is often analyzed as a -d suffix with an dure to simultaneously induce both morphological analyses e-insertion rule, or, as in the case of symbolized, analyzed as a and spelling rules. Our results are promising: our model is -d suffix with an s-deletion rule. The latter case is somewhat able to identify morphological analyses that produce more ac- attributable to data sparsity, where the base form, symbolize, curate stems than the baseline while also inducing a number is not found in the data. In these circumstances it can be of spelling rules that correctly characterize the transforma- preferable to analyze these as symbolizes. with an empty rule, Table 2: Performance of the baseline model and two augmented models, measured using pairwise precision (PP), pairwiserecall (PR), pairwise F-measure (PF), and underlying form accuracy (UFA).
Of course, our model is still somewhat preliminary in sev- [Gilks et al., 1996] W.R. Gilks, S. Richardson, and D. J.
eral respects. For example, a single stem and suffix is in- sufficient to capture the morphological complexity of many Practice. Chapman and Hall, Suffolk, 1996.
languages (including English), and substitution rules should [Goldsmith, 2001] J. Goldsmith. Unsupervised learning of ideally be allowed along with deletions and insertions. Ex- the morphology of a natural language. Computational Lin- tending the model to allow for these possibilities would cre- ate many more potential analyses, making it more difficult toidentify appropriate solutions. However, there are also many [Goldsmith, 2006] J. Goldsmith. An algorithm for the un- sensible constraints that could be placed on the system that supervised learning of morphology. Journal of Natural we have yet to explore. In particular, aside from assuming Language Engineering, 12(3):1–19, 2006.
that empty rules are more likely than others, we placed no [Goldwater and Johnson, 2004] S. Goldwater and M. John- particular expectations on the kinds of rules that should occur.
son. Priors in Bayesian learning of phonological rules. In However, assuming some rough knowledge of the pronunci- Proceedings of the Seventh Meeting of the ACL Special ation of different letters (or a phonological transcription), it Interest Group in Computational Phonology (SIGPHON would be possible to use our priors to encode the kinds of transformations that are more likely to occur (e.g., vowels [Goldwater and McClosky, 2005] S. Goldwater and D. Mc- to vowels, consonants to phonologically similar consonants).
Closky. Improving statistical MT through morphological We hope to pursue this line of work in future research.
analysis. In Proceedings of Empirical Methods in NaturalLanguage Processing, Vancouver, 2005.
[Goldwater et al., 2006] S. Goldwater, T. Griffiths, and The authors would like to thank Hanna Wallach for useful M. Johnson. Interpolating between types and tokens by discussions regarding hyperparameter inference.
estimating power-law generators. In Advances in Neural Information Processing Systems 18, 2006.
Larkey et al., 2002] L. Larkey, L. Ballesteros, and M. Con- Baayen et al., 1995] R. Baayen, R. Piepenbrock, and L. Gu- Improving stemming for Arabic information re- likers. The CELEX lexical database (release 2), 1995.
trieval: Light stemming and co-occurrence analysis. In [Chomsky and Halle, 1968] N. Chomsky and M. Halle. The Proceedings of the 25th International Conference on Re- Sound Pattern of English. Longman Higher Education, search and Development in Information Retrieval (SIGIR), [Creutz and Lagus, 2005] M. Creutz and K. Lagus. Induc- [Marcus et al., 1993] M. Marcus, B. Santorini, and M. A.
ing the morphological lexicon of a natural language from Marcinkiewicz. Building a large annotated corpus of En- unannotated text. In Proceedings of the International and Interdisciplinary Conference on Adaptive Knowledge Rep- [Dasgupta and Ng, 2006] S. Dasgupta and V. Ng. Unsuper- vised morphological parsing of Bengali. Language Re- sources and Evaluation, 40((3-4)), 2006.
[Dasgupta and Ng, 2007] S. Dasgupta and V. Ng.
[Monson et al., 2004] C. Monson, A. Lavie, J. Carbonell, performance, language-independent morphological seg- and L. Levin. Unsupervised induction of natural language mentation. In Proceedings of (NAACL-HLT), 2007.
morphology inflection classes. In Proceedings of the Sev- [Freitag, 2005] D. Freitag. Morphology induction from term enth Meeting of the ACL Special Interest Group in Compu- In Proceedings of the Ninth Conference on tational Phonology (SIGPHON ’04), pages 52–61, 2004.
Computational Natural Language Learning (CONLL ’05), phonology: Discrimination and robustness. In R. Bod, [Gelman et al., 2004] A. Gelman, J. Carlin, H. Stern, and J. Hay, and S. Jannedy, editors, Probabilistic linguistics.

Source: http://people.cs.umass.edu/~narad/_papers/ijcai09.pdf

Bcp_2484.fm

DOI:10.1111/j.1365-2125.2005.02484.x Favourable dermal penetration of diclofenac after administration to the skin using a novel spray gel formulation Martin Brunner,1 Pejman Dehghanyar,1 Bernd Seigfried,2 Wolfgang Martin,3 Georg Menke4 & Markus Müller1 1 Department of Clinical Pharmacology, Division of Clinical Pharmacokinetics, Medical University of Vienna, Vien

Diabetesmedicationsupplement3_2_edit-

Primary Action Typical Dosage Side Effects Glyburide: 1.25–2.50 mg/day twice a day; Glipizide: 2.5–20.0 mg/day twice a day; Maximum, 40 mg/day; or XL* 2.5–10.0 mg/day Glimepiride: 1–8 mg/day; maximum, 8 mg/day before meals 2–4 times a day HbA1c >8: 1–2 mg, 15–30 min after each meal; increase weekly until results are obtained; maximum, 16 mg/day 4–6 wk; m

© 2008-2018 Medical News