Improving Morphology Induction by Learning Spelling Rules
or shutt.ing. In the first case, shutting will be correctly identi-fied as sharing a stem with words such as shut and shuts, but
Unsupervised learning of morphology is an impor-
will not share a suffix with words such as walking and run-
tant task for human learners and in natural lan-
ning. In the second case, the opposite will be true. In this
guage processing systems. Previous systems fo-
paper, we present a Bayesian model of morphology that iden-
cus on segmenting words into substrings (taking ⇒
tifies the latent underlying morphological analysis of each
tak.ing), but sometimes a segmentation-only anal-
word (shut+ing)2 along with spelling rules that generate the
ysis is insufficient (e.g., taking may be more ap-
propriately analyzed as take+ing, with a spelling
Most current systems for unsupervised morphological
rule accounting for the deletion of the stem-final
analysis in NLP are based on various heuristic methods and
e). In this paper, we develop a Bayesian model
perform segmentation only [Monson et al., 2004; Freitag,
for simultaneously inducing both morphology and
2005; Dasgupta and Ng, 2006]; [Dasgupta and Ng, 2007] also
infers some spelling rules. Although these can be effective,
spelling rules improves performance over the base-
our goal is to investigate methods which can eventually be
built into larger joint inference systems for learning multi-ple aspects of language (such as morphology, phonology, and
syntax) in order to examine the kinds of structures and bi-ases that are needed for successful learning in such a system.
In natural language, words are often constructed from multi-
For this reason, we focus on probabilistic models rather than
ple morphemes, or meaning-bearing units, such as stems and
suffixes. Identifying the morphemes within words is an im-
Previously, [Goldsmith, 2006] and [Goldwater and John-
portant task both for human learners and in natural language
son, 2004] have described model-based morphology induc-
processing (NLP) systems, where it can improve performance
tion systems that can account for some variations in morphs
on a variety of tasks by reducing data sparsity [Goldwater and
caused by spelling rules. Both systems are based on the Min-
McClosky, 2005; Larkey et al., 2002]. Unsupervised learning
imum Description Length principle and share certain weak-
of morphology is particularly interesting, both from a cog-
nesses that we address here. In particular, due to their com-
nitive standpoint (because developing unsupervised systems
plex MDL objective functions, these systems incorporate
may shed light on how humans perform this task) and for
special-purpose algorithms to search for the optimal morpho-
NLP (because morphological annotation is scarce or nonex-
logical analysis of the input corpus. This raises the possibility
istent in many languages). Existing systems, such as [Gold-
that the search procedures themselves are influencing the re-
smith, 2001] and [Creutz and Lagus, 2005], are relatively
sults of these systems, and makes it difficult to extend the
successful in segmenting words into constituent morphs (es-
underlying models or incorporate them into larger systems
sentially, substrings), e.g. reporters ⇒ report.er.s. However,
other than through a strict 1-best pipelined approach. Indeed,
strategies based purely on segmentation of observed forms
each of these systems extends the segmentation-only system
make systematic errors in identifying morphological relation-
of [Goldsmith, 2001] by first using that system to identify a
ships because many of these relationships are obscured by
segmentation, and then (in a second step), finding spelling
spelling rules that alter the observed forms of words.1 For
rules to simplify the original analysis. In contrast, the model
example, most English verbs take -ing as the present continu-
presented here uses standard sampling methods for inference,
ous tense ending (walking), but after stems ending in e, the e
and provides a way to simultaneously learn both morpholog-
is deleted (taking), while for some verbs, the final stem con-
ical analysis and spelling rules, allowing information from
sonant is doubled (shutting, digging). A purely segmenting
each component to flow to the other during learning. We
system will be forced to segment shutting as either shut.ting
1Human learners encounter an analogous problem with phono-
2In what follows, we use ‘+’ to indicate an underlying morpheme
logical rules that alter the observed forms of spoken words.
boundary, and ‘.’ to indicate a surface segmentation.
show that the addition of spelling rules allows our model to
outperform the earlier segmentation-only Bayesian model of
[Goldwater et al., 2006], on which it is based.
In the remainder of this paper, we begin by reviewing the
baseline model from [Goldwater et al., 2006]. We then de-
scribe our extensions to it and the sampler we use for infer-
ence. We present experiments demonstrating that the com-
bined morphology-spelling model outperforms the baseline.
Finally, we discuss remaining sources of error in the system
and how we might address them in the future.
Figure 1: Example output from the baseline system. Stem-
final e is analyzed as a suffix (or part of one), so that themorphosyntactic relationships between pairs such as (aban-
We take as our baseline the simple model of morphology de-
don,abate) and (abandons, abates) are lost.
scribed in [Goldwater et al., 2006], which generates a wordw in three steps:
1. Choose a morphological class c for w.
2. Choose a stem t conditioned on c.
P (ci | c−i) · P (ti | t−i, c−i, ci)
3. Choose a (possibly empty) suffix f conditioned on c.
Since t and f are assumed to be conditionally independent
where N is the total number of words and the notation x−i
1 . . . xi−1. The probability of each factor is com-
puted by integrating over the parameters associated with that
all stem-suffix combinations that can be concatenated to formw. This model is of course simplistic in its assumption that
words may only consist of two morphs; however, for the test
set of English verbs that was used by [Goldwater et al., 2006],
two morphs is sufficient. A similar model that allows multiple
morphs per word is described in [Goldsmith, 2001].
Goldwater et al. present the model above within a Bayesian
is the number of occurrences of c in c−i, n(−i)
framework in which the goal is to identify a high-probability
is the length of c−i (= i − 1), and C is the total number of
sequence of classes, stems, and suffixes (c, t, f ) given an ob-
possible classes. The value of the integration is a standard
served sequence of words w. This is done using Bayes’ rule:
result in Bayesian statistics [Gelman et al., 2004], and can beused (as Goldwater et al. do) to develop a Gibbs sampler for
P (c, t, f | w) ∝ P (w | c, t, f )P (c, t, f )
inference. We defer discussion of inference to Section 4.
While the model described above is effective in segment-
Note that the likelihood P (w | c, t, f ) can take on only two
ing verbs into their stems and inflectional suffixes, such a
possible values: 1 if the observed words are consistent witht
segmentation misses certain linguistic generalizations, as de-
scribed in the introduction and illustrated in Figure 1. In order
tion over analyses P (c, t, f ) is crucial to inference. As in
to identify these generalizations, it is necessary to go beyond
other model-based unsupervised morphology learning sys-
simple segmentation of the words in the input. In the follow-
tems [Goldsmith, 2001; Creutz and Lagus, 2005], Goldwa-
ing section, we describe an extension to the above generative
ter et al. assume that sparse solutions – analyses containing
model in which spelling rules apply after the stem and suf-
fewer total stems and suffixes – should be preferred. This is
fix are concatenated together, so that the stem and suffix of
done by placing symmetric Dirichlet priors over the multino-
each word may not correspond exactly to a segmentation of
mial distributions from which c, t, and f are drawn:
To extend the baseline model, we introduce the notion of a
spelling rule, inspired by the phonological rules of Chomsky
where θc, θt|c, and θf|c are the multinomial parameters for
and Halle [1968]. Each rule is characterized by a transfor-
classes, stems, and suffixes, and κ, τ , and φ are the respec-
mation and a context in which the transformation applies. We
tive Dirichlet hyperparameters. We discuss below the signif-
develop two models, one based on a two-character context
icance of the hyperparameters and how they can be used to
formed with one left context character and one right context
favor sparse solutions. Under this model, the probability of a
character, and the other based on a three-character context
with an additional left context character. We assume that
Computing the conditional probability of A(wi) is straight-
transformations only occur at morpheme boundaries, so the
forward because the Dirichlet-multinomial distributions we
context consists of the final one or two characters of the (un-
have constructed our model from are exchangeable: the prob-
derlying) stem, and the first character of the suffix. For ex-
ability of a set of outcomes does not depend on their ordering.
ample, shut+ing, take+ing, sleep+s have contexts ut i, ke i,
We can therefore treat each analysis as though it is the last one
ep s. Transformations can include insertions, deletions, or
in the data set, and apply the same integration over parame-
empty rules, and always apply to the position immediately
ters that led to Equation 5. The full sampling equations for
preceding the morpheme boundary, i.e. deletions delete the
stem-final character and insertions insert a character follow-
Our model contains a number of hyperparameters. Rather
ing the stem-final character.3 So the rule ε →
than setting these by hand, we optimize them by maximiz-
ing the posterior probability of each hyperparameter given all
ε / ep s produces sleeps. Our new model extends
other variables in the model. For example, to maximize τ we
the baseline generative process with two additional steps:
4. Choose the rule type y (insertion, deletion, empty) con-
ditioned on x(f, t), the context defined by t and f .
τ = argmax P (τ |κ, τ, φ, ρ, η, η, ηE, ηI , ηD, c, t, f, y, r)
5. Choose a transformation r conditioned on y and x(f, t).
which gives us the following joint probability:
P (c)P (t|c)P (f |c)P (y|x(f, t))P (r|y, x(f, t)) (6)
As above, we place Dirichlet priors over the multinomial
which can be optimized iteratively using the following
distributions from which y and r are chosen. Our expec-
tations are that most rules should be empty (i.e., observedforms are usually the same as underlying forms), so we use
a non-symmetric Dirichlet prior over rule types, with η =
D , ηI , ηE ) being the hyperparameters over insertion, dele-
tion, and empty rules, where ηE is set to a much larger value
than ηD and ηI (we discuss this in more detail below). In ad-dition, at most one or two different transformations should oc-
In this section, we describe the experiments used to test our
cur in any given context. We encourage this by using a small
morphological induction system. We begin by discussing our
value for ρ, the hyperparameter of the symmetric Dirichlet
input data sets, then present two distinct evaluation methods,
and finally describe the results of our experiments.
For input data we use the same data set used by [Goldwater
We sample from the posterior distribution of our model
et al., 2006], the set of 7487 English verbs found in the Penn
P (c, t, f , y, r | w) using Gibbs sampling, a standard Markov
Wall Street Journal (WSJ) corpus [Marcus et al., 1993]. En-
chain Monte Carlo (MCMC) technique [Gilks et al., 1996].
glish verbs provide a good starting point for evaluating our
Gibbs sampling involves repeatedly sampling the value of
system because they contain many regular patterns, but also a
each variable in the model conditioned on the current values
number of orthographic transformations. We do not include
of all other variables. This process defines a Markov chain
frequency information in our input corpus; this is standard
whose stationary distribution is the posterior distribution over
in morphology induction and has both psychological [Pierre-
model variables given the input data. Because the variables
humbert, 2003] and mathematical justifications [Goldwater et
that define the analysis of a given word are highly dependent
(only certain choices of t, f, y and r are consistent), we useblocked sampling to sample all variables for a single word at
once. That is, we consider each word wi in the data in turn,
Although evaluation measures based solely on a gold stan-
consider all possible values of (c, t, f, y, r) comprising a con-
dard surface segmentation are sometimes used, it should be
sistent analysis A(wi) of wi, and compute the probability of
clear from our introduction that this kind of measure is not
each full analysis conditioned on the current analyses of all
sufficient for our purposes. Instead, we use two different
other words. We then sample an analysis for the current word
evaluation measures based on the underlying morphological
according to this distribution and move on to the next word.
structure of the data. Both of our evaluation methods use
After a suitable burn-in period, the sampler converges to sam-
the English portion of the CELEX database [Baayen et al.,
pling from the posterior distribution.
1995] to determine correctness. It contains morphological
3Permitting arbitrary substitution rules allows too much freedom
analyses of 160,594 different inflected wordforms based on
to the model and yields poor results; in future work we hope to
52,446 uninflected lemmata. Each morphological analysis
achieve better results by using priors to constrain substitutions in
includes both a surface segmentation as well as an abstract
morphosyntactic analysis which provides the functional role
P (A(wi) = (c, t, f, y, r) | A(w−i), κ, τ, φ, η, ρ)
∝ I(wi = r(t.f )) · P (c, t, f, y, r | A(w−i), κ, τ, φ, η, ρ)∝ P (c | c−i, κ) · P (t | t−i, c, τ ) · P (f | f−i, c, φ) · P (y | y−i, t, f , η) · P (r | r−i, t, f , y, ρ)
Figure 2: Equations used in sampling to compute the probability of of the analysis A(wi) of wi, conditioned on A(w−i),the analyses of all other words in the data set. We use the notation x−i here to indicate x1 . . . xi−1, xi+1 . . . xN . I(.) is afunction taking on the value 1 when its argument is true, and 0 otherwise. κ, τ, φ, η, ρ are the hyperparameters for the Dirichletdistributions associated with classes, stems, suffixes, rule types, and rules, respectively; and C, T, F, R specify the total numberof possible values for classes, stems, suffixes, and rules. Note that for y = delete or empty, there is only one possible rule, soR = 1 and the final factor cancels out. For y = insert, R = 26.
By comparing the pairwise relationships in the system out-
put and the gold standard, we can compute pairwise preci-
sion (PP) as the proportion of proposed pairs that are correct,
and pairwise recall (PR) as the proportion of true pairs that
are correctly identified. This is reported separately for stems
and suffixes along with the F-Measures of each, calculated as
Our second evaluation method is designed to more directly
test the correctness of underlying forms by using the analyses
provided in CELEX to reconstruct an underlying form (UF)
for each surface form. To identify the underlying stem for a
word, we use the lemma ID number, which is the same for all
inflected forms and specifies the canonical dictionary form,which is identical to the stem. To identify the underlying suf-
Table 1: An example illustrating the resources used for eval-
fix, we map each of the suffix functional labels to a canonical
uation and our two scoring methods. We suppose that Found
string representation. Specifically, pe ⇒ ing, a1S ⇒ ed,
is the analysis found by the system. CX string is the segmen-
e3S ⇒ s, and all other labels are mapped to the empty string
tation of the surface form given in CELEX. CX abstract is
ε. When the CELEX surface segmentation of an inflected
the abstract morpheme analysis given in CELEX (with each
form has an empty suffix, indicating an irregular form such
stem represented by a unique ID, and each suffix represented
as forgot.ε, we use the surface segmentation as the UF. We
by a code such as pe for present participle), used to compute
can then compute underlying form accuracy (UFA) for stems
pairwise precision (PP) and pairwise recall (PR). UF string is
as the proportion of found stems that match those in the UFs,
the underlying string representation we derived based on the
two CELEX representations (see text), used to compute UFaccuracy (UFA). UF strings that do not match those found
by the system are shown in bold. In this example, scores for
Our inference procedure alternates between sampling the
stems are 10/13 (UFA), 8/10 (PP), and 8/15 (PR). Scores for
variables in the model and updating the hyperparameters. For
suffixes are 11/13 (UFA), 9/12 (PP), and 9/16 (PR).
both the baseline and spelling-rule system, we ran the algo-rithm for 5 epochs, with each epoch containing 10 iterations
of any inflectional suffixes. For example, the word walking
of sampling and 10 iterations of hyperparameter updates. Al-
is segmented as walk.ing, and is accompanied by a pe label
though it is possible to automatically learn values for all of the
to denote the suffix’s role in marking it as a present tense (e)
hyperparameters in the model, we chose to set the values of
participle (p). See Table 1 for further examples.
the hyperparameters over rule types by hand to reflect our in-
Our first evaluation method is based on the pairwise re-
tuitions that empty rules should be far more prevalent than in
lational measure used in the recent PASCAL challenge on
insertions or deletions. That is, the hyperparameter for empty
unsupervised morpheme analysis.4 Consider the proposed
rules ηE should be relatively high, while the hyperparameters
analysis walk+ing and its corresponding gold standard en-
determining insertion and deletion rules, ηI and ηD, should
try 50655+pe. Assuming that this analysis is correct, any
be low (and, for simplicity, we assume they are equal). Re-
other correct analysis that shares the stem walk should also
sults reported here use ηE = 5, ηI = ηD = .001 (although
share the same stem ID 50655, and likewise for the suffixes.
other similar values yield similar results). All other hyperpa-rameters were learned.
4http://www.cis.hut.fi/morphochallenge2007/
The remaining model parameters are either determined by
and symbolizes+d with the erroneous s-deletion rule (Figure
4), so that they share the same stem. These analyses would
not be likely using a larger data set.
Second, the presence of derivational verbs in the data is
a contributing factor because they are not analyzed correctly
in the inflectional verbs section of CELEX, which forms our
gold standard. Consider that the baseline provides the most
succinct analysis of suffixes, positing just four (-ε, -s, -ed, and
-ing), whereas the three-character-context model induces five
(the same four with the addition of -d). The two-character-
context model, the worst-performing system on suffix UFA,
learns an additional five suffixes (-e, -es, -n, -ize, and -ized).
Not all of these additional forms are unreasonable; -ize and -nare both valid suffixes, and -ized is the remainder of a correct
Figure 3: Induced Analyses. Incorrect analyses are shaded.
segmentation. However, because suffixes like -ize are deriva-tional (they change the part-of-speech of the root they attach
to), they are not considered as part of the canonical dictionary
of our gold standard. In this situation the UFA metric there-
fore provides an upper-bound for the baseline, but a lower-
The pair-wise metrics are also susceptible to this problem,
but continue to support the conclusions reached previously
on overall system performance. The baseline slightly out-
performs the three-character-context model in stem PP, but
compares quite poorly in stem PR, and in stem PF. It again
performs better than the augmented models on suffix tasks.
Figure 4: Commonly Induced Rules by Frequency.
Worth noting is that the errors made according to this metricare a small set of very pervasive mistakes. For instance, im-properly segmenting -ed suffixes as -d suffixes or segmenting
the data or set by hand. For the WSJ verbs data set the num-
a stem-final e as its own suffix together contribute to more
ber of possible stems, T = 7, 306, 988, and the number of
than half of all erroneous suffixes proposed by this model.
possible suffixes, F = 5, 555, are calculated by enumerating
In addition to improved performance on the morphology
all possible segmentation of the words in the data set and ac-
induction task, our system also produces a probabalistic rep-
counting for every possible rule transformation. We set the
resentation of the phonology of a language in the spelling
number of classes C = 1 and the minimum stem length to
rules it learns. The most frequently learned rules (Table 4) are
three chararacters. Enforcing a minimum stem length ensures
largely correct, with just two spurious rules induced. While
that even in the case of the most minimal stem and the appli-
many of these are linguistically redundant because of the
cation of an insertion rule, the underlying stem will still have
overspecification of their contexts, most refer to valid, desir-
two characters to form the left context.
able orthographic rules. Examples of these are e-deletion invarious contexts (state+ing ⇒ stating ), e-insertions (pass+s
⇒ passes), and consonant doubling when taking the -ing suf-
Quantitative results for the two systems are shown in Table
fix (forget+ing ⇒ forgetting, spam+ing ⇒ spamming).
2, with examples of full analyses shown in Figure 3 and themost commonly inferred spelling rules in Figure 4. Overall,
the augmented models dramatically outperform the baselineon the UFA stem metric, which is is not surprising consider-
As we noted in the introduction, one of the difficulties of un-
ing that it is the introduction of rules that allows these mod-
supervised morphology induction is that spelling rules often
els to correctly capture stems that may have been improperly
act to obscure the morphological analyses of the observed
segmented in the baseline (Figure 1).
words. A few previous model-based systems have tried to
However, the baseline performs better on suffix UFA by a
deal with this, but only by first segmenting the corpus into
fair margin. There are at least two contributing factors caus-
morphs, and then trying to identify spelling rules to sim-
ing this. First, the addition of spelling rules allows the model
plify the analysis. To our knowledge, this is the first work
to explain some suffixes in alternate undesirable ways. For
to present a probabilistic model using a joint inference proce-
instance, the -ed suffix is often analyzed as a -d suffix with an
dure to simultaneously induce both morphological analyses
e-insertion rule, or, as in the case of symbolized, analyzed as a
and spelling rules. Our results are promising: our model is
-d suffix with an s-deletion rule. The latter case is somewhat
able to identify morphological analyses that produce more ac-
attributable to data sparsity, where the base form, symbolize,
curate stems than the baseline while also inducing a number
is not found in the data. In these circumstances it can be
of spelling rules that correctly characterize the transforma-
preferable to analyze these as symbolizes. with an empty rule,
Table 2: Performance of the baseline model and two augmented models, measured using pairwise precision (PP), pairwiserecall (PR), pairwise F-measure (PF), and underlying form accuracy (UFA).
Of course, our model is still somewhat preliminary in sev-
[Gilks et al., 1996] W.R. Gilks, S. Richardson, and D. J.
eral respects. For example, a single stem and suffix is in-
sufficient to capture the morphological complexity of many
Practice. Chapman and Hall, Suffolk, 1996.
languages (including English), and substitution rules should
[Goldsmith, 2001] J. Goldsmith. Unsupervised learning of
ideally be allowed along with deletions and insertions. Ex-
the morphology of a natural language. Computational Lin-
tending the model to allow for these possibilities would cre-
ate many more potential analyses, making it more difficult toidentify appropriate solutions. However, there are also many
[Goldsmith, 2006] J. Goldsmith. An algorithm for the un-
sensible constraints that could be placed on the system that
supervised learning of morphology. Journal of Natural
we have yet to explore. In particular, aside from assuming
Language Engineering, 12(3):1–19, 2006.
that empty rules are more likely than others, we placed no
[Goldwater and Johnson, 2004] S. Goldwater and M. John-
particular expectations on the kinds of rules that should occur.
son. Priors in Bayesian learning of phonological rules. In
However, assuming some rough knowledge of the pronunci-
Proceedings of the Seventh Meeting of the ACL Special
ation of different letters (or a phonological transcription), it
Interest Group in Computational Phonology (SIGPHON
would be possible to use our priors to encode the kinds of
transformations that are more likely to occur (e.g., vowels
[Goldwater and McClosky, 2005] S. Goldwater and D. Mc-
to vowels, consonants to phonologically similar consonants).
Closky. Improving statistical MT through morphological
We hope to pursue this line of work in future research.
analysis. In Proceedings of Empirical Methods in NaturalLanguage Processing, Vancouver, 2005.
[Goldwater et al., 2006] S. Goldwater, T. Griffiths, and
The authors would like to thank Hanna Wallach for useful
M. Johnson. Interpolating between types and tokens by
discussions regarding hyperparameter inference.
estimating power-law generators. In Advances in Neural
Information Processing Systems 18, 2006.
Larkey et al., 2002] L. Larkey, L. Ballesteros, and M. Con-
Baayen et al., 1995] R. Baayen, R. Piepenbrock, and L. Gu-
Improving stemming for Arabic information re-
likers. The CELEX lexical database (release 2), 1995.
trieval: Light stemming and co-occurrence analysis. In
[Chomsky and Halle, 1968] N. Chomsky and M. Halle. The
Proceedings of the 25th International Conference on Re-
Sound Pattern of English. Longman Higher Education,
search and Development in Information Retrieval (SIGIR),
[Creutz and Lagus, 2005] M. Creutz and K. Lagus. Induc-
[Marcus et al., 1993] M. Marcus, B. Santorini, and M. A.
ing the morphological lexicon of a natural language from
Marcinkiewicz. Building a large annotated corpus of En-
unannotated text. In Proceedings of the International and
Interdisciplinary Conference on Adaptive Knowledge Rep-
[Dasgupta and Ng, 2006] S. Dasgupta and V. Ng. Unsuper-
vised morphological parsing of Bengali. Language Re-
sources and Evaluation, 40((3-4)), 2006.
[Dasgupta and Ng, 2007] S. Dasgupta and V. Ng.
[Monson et al., 2004] C. Monson, A. Lavie, J. Carbonell,
performance, language-independent morphological seg-
and L. Levin. Unsupervised induction of natural language
mentation. In Proceedings of (NAACL-HLT), 2007.
morphology inflection classes. In Proceedings of the Sev-
[Freitag, 2005] D. Freitag. Morphology induction from term
enth Meeting of the ACL Special Interest Group in Compu-
In Proceedings of the Ninth Conference on
tational Phonology (SIGPHON ’04), pages 52–61, 2004.
Computational Natural Language Learning (CONLL ’05),
phonology: Discrimination and robustness. In R. Bod,
[Gelman et al., 2004] A. Gelman, J. Carlin, H. Stern, and
J. Hay, and S. Jannedy, editors, Probabilistic linguistics.
DOI:10.1111/j.1365-2125.2005.02484.x Favourable dermal penetration of diclofenac after administration to the skin using a novel spray gel formulation Martin Brunner,1 Pejman Dehghanyar,1 Bernd Seigfried,2 Wolfgang Martin,3 Georg Menke4 & Markus Müller1 1 Department of Clinical Pharmacology, Division of Clinical Pharmacokinetics, Medical University of Vienna, Vien
Primary Action Typical Dosage Side Effects Glyburide: 1.25–2.50 mg/day twice a day; Glipizide: 2.5–20.0 mg/day twice a day; Maximum, 40 mg/day; or XL* 2.5–10.0 mg/day Glimepiride: 1–8 mg/day; maximum, 8 mg/day before meals 2–4 times a day HbA1c >8: 1–2 mg, 15–30 min after each meal; increase weekly until results are obtained; maximum, 16 mg/day 4–6 wk; m