# (1) Calculate the longest common subsequence for n strings # by MLCS = Longest(Subsequence(s1) & ... & Subsequence(sn)); # (2) Create a sequence of random bracketings of this string, #-separated, e.g. # [k][t][b]#[k][t][b]#...#[k][t][b] or # [kt][b]#[kt][b]#...#[kt][b] or # [ktb]#[ktb]#...#[ktb] ... # we do this by MarkRoot(MLCS) # (3) Now insert arbitrary material outside brackets in this set of strings, yielding e.g. # X[kt]X[b]X#X[kt]X[b]X#...#X[kt]X[b]X , where X = ?* # call this "RepeatedPatterns". # Optimization: # (a) note that we can force the bracketings to be the same in each word, like: # X[kt]X[b]X#...#X[kt]X[b]X (we call this RepeatedPatternsEQ) # (b) or we can simply not care where things like # X[k][t]X[b]X#...#X[kt]X[b]X (we call this RepeatedPatterns) # would be allowed. # Using (b), we can calculate everything much faster and only filter with # RepeatedPatternsEQ at the very end. However, this does _not_ guarantee a result. # However, if we get a result, it's guaranteed to be correct. So we make two calls # perl-side: using (b) first, and if that doesn't work out, using (a). # (4) Separately, take the paradigm (with each entry #-separated), and randomly bracket it: # [ka]ta[b]a#sta[kt]ab[at]#...#[kuu]tib # call this BracketedWordSeq, calculated by RandomBracketing(WordSeq) # (5) Intersect RepeatedPatterns and BracketedWordSeq yielding bracketed # sequences of the MLCS in the paradigm, e.g. # [k]a[t]a[b]a#sta[k][t]a[b]at#[k]uu[t]i[b] # (6) This set, while being almost what we want, can be ambiguous # and we'd like to choose from all valid strings, the string that contains # the longest continuous bracketed sequences, i.e. prefer [kt] over [k][t], etc. # Of course, this is only possible if [kt] always appears consecutively # in *every* word of the paradigm. We solve this by worsening, i.e. # from the set X, calculate X - Worsen(X), where worsen shortens at least one # continuous bracketed sequence, i.e. does mappings like ...[ktb]... -> ...[kt][b]... # We call the worsener Filter1. # (7) While the worsening in (6) catches almost all suboptimal bracketings, there is one type # of bracketing that it fails to catch, consider: # opt. [compr]ar#...#[compr]a case(a) # sub. [comp]ra[r]#...#[comp][r]a case(b) # Consider worsening (a): there is no way to worsen (a) to yield (b) by shortening # bracketed material, i.e. to produce [comp]ra[r] from [compr]ar. However, this is only # true globally for the string. Locally, for some suboptimal substring in (b), worsening # will work. For example, the second substring [comp][r]a (b) can be produced from [compr]a (a) # To take advantage of this, we need a more advanced worsener, one that first randomly moves brackets # in all subwords except one, and then worsens this one entry, allowing us to produce string (b) # from string (a). Since this worsener [Filter2] is more complex and general, we run it after # running Filter1, since this produces gains in speed. That is, Filter1 described in (6) is really # unnecessary, but can be seen as a crude filter that makes the calculations for Filter2 faster. # (8) We also implement (perl-side), a filter for breaking further ties, such as: # #..[dreg]e[l]..# vs. #..[drege]l..# by totaling the number of infixed characters used in the paradigm. # In the above, we have 1 for the former and 0 for the latter word form, for example. # (9) Some residual ties are left, which may need further refinement for achieving a total order on the # grouping of the LCS. For example, patterns like: # #...[kat][o]sta...# vs. #...[ka][to]sta...# might need further disambiguation... define RedupN(X,Y) `[`[_eq([LEFT X RIGHT [Y LEFT X RIGHT]*], LEFT, RIGHT), LEFT,0],RIGHT,0]; define Subsequence(X) [X .o. [?|?:0]*].l; define Longest(X) X - [[X .o. ?:a* ?:0+].l .o. a:?*].l; define NOBR ? - %[ - %] - %#; define Worsen(X) [[X .o. [?* \[%[|%]] 0:%] 0:%[ \[%]|%[]+ %] ?*]*] .o. X].l; define Filter1(X) X .o. ~[Worsen(X)]; define Worsen2(X) [[X .o. 0:%# ?* 0:%# .o. ?* %# [NOBR | %[:%< | %]:%>]* %# ?* .o. %[|%] -> 0 .o. [?* 0:%> 0:%< \[%<|%>|%[|%] ]+ %> ?*]* .o. %#:0 ?* %#:0 .o. 0 -> %[|%] , %< -> %[ , %> -> %]] .o. X].l; define Filter2(X) X .o. ~[Worsen2(X)]; define Markup(X) [X .o. [..] -> %" || [.#.|%#] _ NOBR , NOBR _ %[ , %] _ NOBR , NOBR _ [.#.|%#] .o. %[|%] -> %+ .o. %+ -> 0 || [.#.|%#] _ , _ [.#.|%#] .o. %+ %++ -> %+].l; define MarkRoot(X) [X .o. ?+ -> %[ ... %] .o. ~$[ %| \%] ] ].l; define RandomBracketing(X) [X .o. [? | 0:%[ NOBR* 0:%]]*].l; define AddExtra(X) [X .o. [0:NOBR* | %[ \%]* %] | %#]* ].l;