Fix the dictionary proliferation problem
Summary
Fix the proliferation of dictionaries in the OS.
Owners
Current status
- Targeted release: Fedora 9
- Last modified: 2008-04-07
- Percentage of completion: 100%
- This is complete, all major applications and default GNOME/KDE spell checking now goes through hunspell. All that remains is to package dictionaries for the lesser used languages where there hasn't already been a sufficiently vibrant fedora-using language community that has taken up packaging a dictionary for their language.
Usage cases/rationale
We have separate dictionaries for each language for OpenOffice.org, Firefox, Thunderbird, and aspell (which gnome and KDE use). This is dumb.
Benefit to Fedora
We get code reuse, a smaller distribution, and a decreased memory footprint.
Scope
Requires changing the OpenOffice.org, thunderbird, firefox, and dictionary packages.
Test Plan
Test spell checking in all apps.
Dependencies
None.
Details
- Split out hunspell from Open
Office.org - rhbz#214764 complete
- Make Open
Office.org use it - rhbz#214764 complete
- Split out the dictionaries into separate packages - rhbz#218769 (english) complete
- Make Open
Office.org use system dictionaries - complete
- Make gedit/xchat use it, i.e. enchant. enchant by default already generally prefers using hunspell over aspell, just needs to be told where the dictionaries are - complete
- Make evolution use it, i.e. gnome-spell. gnome-spell can be patched to use enchant to achieve this - rhbz#426347 complete
- Make tomboy/pidgin use it, i.e. gtkspell. Same story as gnome-spell - rhbz#245888 complete
- Make Firefox (and other gecko apps) use it - rhbz#218762 complete, upstream state is now resolved
- Make KDE use enchant and/or hunspell - complete - KDE 4 already defaults to enchant in Sonnet. (For !K3Spell, see "legacy KSpell" below.) The aspell backend was dropped entirely in Rawhide. For kdelibs3:
- The legacy KSpell uses command-line spellcheckers. Kevin Kofler wrote a patch to support hunspell, and kde-settings in Rawhide was changed to make it the default.
- The newer KSpell2 API is plugin-based and uses libraries. It is what KDE 4's Sonnet is based on. Kevin Kofler backported Sonnet's enchant backend. The aspell and ispell backends were dropped in Rawhide.
- See the fedora-devel-list message.
- Remove copy of hunspell from enchant - rhbz#426402 complete
- Remove copy of hunspell from xulrunner complete
- Split enchant to have a separate enchant-aspell rpm to enable optionally removing the aspell support - rhbz#426402 complete
- Prefer hunspell over aspell as the default for install in comps. See table below for mis-match in language support. rhbz#439037 complete
- Repackage/replace the aspell dictionaries with hunspell dictionaries 80% see table below for language support
Optional
- Write an aspell compatibility layer so aspell apps can use the same dictionaries no volunteer -> deferred, is this neccessary at all ? All major desktop apps work now out of the box
- Make vim use hunspell - rhbz#219777 patch available, not necessary if vim continues to not use any spell-checking, but preferred over introducing built-in vim spellchecker which has yet another format which hunspell dicts are converted to for use
Dictionaries
1. Language Support Matrix (glibc upwards)
Language Code | Language | aspell | hunspell | notes |
aa | Afar | afarfriends.org hosted ALSEC report. | ||
af | Afrikaans | aspell-af | hunspell-af | |
am | Amharic | available | And one for non-commercial use | |
an | Aragonese | www.iea.es, see Spain: Lexicography In Iberian Languages | ||
ar | Arabic | aspell-ar | hunspell-ar | |
as | Assamese | www.xobdo.net | ||
ast | Asturian | www.academiadelallingua.com, see Spain: Lexicography In Iberian Languages | ||
az | Azeri | available | ||
be | Belarusian | hunspell-be | ||
ber | Amazigh (Tifinagh) | hunspell-ber | ||
ber | Amazigh (Latin) | |||
bg | Bulgarian | aspell-bg | hunspell-bg | |
bn | Bengali | aspell-bn | hunspell-bn | |
bo | Tibetan | bo.openoffice.org. Latest language support update. | ||
br | Breton | aspell-br | hunspell-br | |
bs | Bosnian | From a pure spelling-dictionary point of view, would there be differences from hunspell-hr ? | ||
byn | Blin | Blin Orthography: A History and an Assessment | ||
ca | Catalan | aspell-ca | hunspell-ca | |
crh | Crimean Tatar | corpus | ||
cs | Czech | aspell-cs | hunspell-cs | |
csb | Kashubian | available | ||
cy | Welsh | aspell-cy | hunspell-cy | |
da | Danish | aspell-da | hunspell-da | |
de | German | aspell-de | hunspell-de | |
dz | Dzongkha | crubadan corpus building | ||
el | Greek | aspell-el | hunspell-el | |
en | English | aspell-en | hunspell-en | |
es | Spanish | aspell-es | hunspell-es | |
et | Estonian | hunspell-ee | ||
eu | Basque | hunspell-eu | ||
fa | Farsi | available | ||
fi | Finnish | Finnish Community has a parallel Voikko solution. With an enchant backend, an OpenOffice.org extension, and a Firefox extension. | ||
fil | Filipino | hunspell-tl | Filipino is effectively an official Tagalog-based language | |
fo | Faeroese | aspell-fo | hunspell-fo | |
fr | French | aspell-fr | hunspell-fr | |
fur | Friulian | hunspell-fur | ||
fy | Frisian | hunspell-fy | ||
ga | Irish | aspell-ga | hunspell-ga | |
gd | Scots Gaelic | aspell-gd | hunspell-gd | |
gez | Ge'ez | Ge'ez Frontier Foundation | ||
gl | Galician | aspell-gl | hunspell-gl | |
gu | Gujarati | aspell-gu | hunspell-gu | |
gv | Manx | crubadan | ||
ha | Hausa | crubadan possible wordlist, www.dictionary.kasahorow.com | ||
he | Hebrew | aspell-he | hunspell-he | |
hi | Hindi | aspell-hi | hunspell-hi | |
hr | Croatian | aspell-hr | hunspell-hr | |
hsb | Upper Sorbian | hunspell-hsb | ||
hu | Hungarian | hunspell-hu | ||
hy | Armenian | hunspell-hy | ||
id | Indonesian | aspell-id | hunspell-id | |
ig | Igbo | crubadan, www.dictionary.kasahorow.com | ||
ik | Inupiaq | Iñupiaq parser project. Broken download link to MSWord dictionary | ||
is | Icelandic | aspell-is | hunspell-is | |
it | Italian | aspell-it | hunspell-it | |
iu | Inuktitut | www.livingdictionary.com | ||
ja | Japanese | |||
ka | Georgian | ka.openoffice.org | ||
kk | Kazakh | available | ||
kl | Kalaallisut | Greenlandic parser project | ||
km | Khmer | hunspell-km | ||
kn | Kannada | BharateeyaOO.o | ||
ko | Korean | |||
ku | Kurdish (Latin) | hunspell-ku | ||
ku | Kurdish (Arabic) | |||
kw | Cornish | crubadan corpus building | ||
ky | Kyrgyz | OOo localization beginnings | ||
lg | Luganda | A general translation effort. | ||
li | Limburgish | crubadan corpus building | ||
lo | Lao | Lao OOo localization | ||
lt | Lithuanian | hunspell-lt | ||
lv | Latvian | hunspell-lv | ||
mai | Maithili | maithiliacademy.org | ||
mg | Malagasy | hunspell-mg | ||
mi | Maori | hunspell-mi | ||
mk | Macedonian | hunspell-mk | ||
ml | Malayalam | aspell-ml | hunspell-ml | |
mn | Mongolian | available | ||
mr | Marathi | aspell-mr | hunspell-mr | |
ms | Malay | hunspell-ms | ||
mt | Maltese | rhbz#467183 | ||
nb | Bokmaal | aspell-no | hunspell-nb | |
nds | Lowlands Saxon | hunspell-nds | ||
ne | Nepali | hunspell-ne | ||
nl | Dutch | aspell-nl | hunspell-nl | |
nn | Nynorsk | aspell-no | hunspell-nn | |
nr | Ndebele (Southern) | hunspell-nr | ||
nso | Sotho (Northern) | hunspell-nso | ||
oc | Occitan | hunspell-oc | ||
om | Oromo | crubadan corpus building. Oromo wiki entry | ||
or | Oriya | aspell-or | hunspell-or | |
pa | Punjabi | aspell-pa | hunspell-pa | |
pap | Papiamento | crubadan corpus building | ||
pl | Polish | aspell-pl | hunspell-pl | |
pt | Portuguese | aspell-pt | hunspell-pt | |
ro | Romanian | hunspell-ro | ||
ru | Russian | aspell-ru | hunspell-ru | |
rw | Kinyarwanda | hunspell-rw | ||
sa | Sanskrit | An apparent effort to create a Sanskrit hunspell dictionary | ||
sc | Sardinian | rhbz#467182 | ||
se | Sami, Northern | available | A colossal 50Megs | |
shs | Secwepemctsin | www.native-languages.org | ||
si | Sinhala | A very small wordlist | ||
sid | Sidamo | Some info | ||
sk | Slovak | aspell-sk | hunspell-sk | |
sl | Slovenian | aspell-sl | hunspell-sl | |
so | Somali | An apparent effort to create a Somali hunspell dictionary | ||
sq | Albanian | hunspell-sq | ||
sr | Serbian | aspell-sr | hunspell-sr | |
ss | Swati | hunspell-ss | ||
st | Sotho (Southern) | hunspell-st | ||
sv | Swedish | aspell-sv | hunspell-sv | |
ta | Tamil | aspell-ta | hunspell-ta | |
te | Telugu | aspell-te | hunspell-te | |
tg | Tajik | An apparent effort to create a Tajik hunspell dictionary | ||
th | Thai | hunspell-th | ||
ti | Tigrigna | non-commercial use | ||
tig | Tigre | crubadan corpus building | ||
tk | Turkmen | hunspell-tk | ||
tl | Tagalog | hunspell-tl | ||
tn | Tswana | hunspell-tn | ||
tr | Turkish | available | But like Finnish through voikko the typical solution for Turkish has been the Zemberek library, and to have an enchant backend, an Openoffice.org Extension, and a Firefox extension) | |
ts | Tsonga | hunspell-ts | ||
tt | Tatar | available | Hard to see where this came from originally, and what license it is exactly, GPLv2+ (?). Perhaps it is an original work of ALT Linux and that actually is the canonical upstream ? | |
ug | Uyghur | www.uyghurdictionary.org | ||
uk | Ukrainian | hunspell-uk | ||
ur | Urdu | hunspell-ur | ||
uz | Uzbek | hunspell-uz | ||
ve | Venda | hunspell-ve | ||
vi | Vietnamese | hunspell-vi | ||
wa | Walloon | hunspell-wa | ||
wo | Wolof | www.alfanet.anafa.org make Wolof localizations of Firefox and Abiword. www.dictionary.kasahorow.com | ||
xh | Xhosa | hunspell-xh | ||
yi | Yiddish | The uspell spell-checker | ||
yo | Yoruba | An apparent effort to create a Yoruba hunspell dictionary. www.dictionary.kasahorow.com | ||
zh | Chinese | |||
zu | Zulu | hunspell-zu |
2. Language Support Matrix (extra OOo recognized not in glibc)
Language Code | Language | hunspell | notes | |
ak | Akan | available | ||
bm | Bambara | Online Dictionary | ||
brx | Bodo | Online Dictionary | ||
cop | Coptic | available | ||
cv | Chuvash | From this forum looks like there was a cv_RU-1.00.zip but download site is gone/down. | ||
dgo | Dogri | |||
dv | Dhivehi | |||
ee | Ewe | |||
eo | Esperanto | available | ||
fj | Fijian | available | ||
gsc | Gascon | available | ||
gug | Guarani | |||
ha | Hausa | |||
hil | Hiligaynon | |||
ia | Interlingua | available | ||
ks | Kashmiri | |||
kok | Konkani | |||
la | Latin | available | ||
lb | Luxembourgish | available | ||
lg | Ganda | |||
ln | Lingala | |||
mos | Moore | |||
mni | Manipuri | |||
my | Burmese | |||
ny | Nyanja | |||
qu | Quechua | available | ||
rm | Raeto-Romance | |||
sat | Santali | |||
sd | Sindhi | |||
sg | Sango | |||
sjd | Sami, Kildin | |||
sma | Sami, Southern | |||
smj | Sami, Lule | |||
smn | Sami, Inari | |||
sms | Sami, Skolt | |||
sw | Swahili | available | ||
tet | Tetun | available |
User experience
Should not affect user experience.
Contingency plan
Continue to ship older dictionaries.
Documentation
Release Notes
There is a new default spell checking back-end, hunspell
, for both the GNOME and KDE desktops, as well as applications such as OpenOffice.org, Firefox, and other XULRunner-based applications. This common back-end includes a set of shared, multi-lingual dictionaries for use with hunspell
. This feature uses a single set of common dictionaries regardless of the application, which gives consistent suggestions for misspelled words and uses less diskpace by eliminating duplicate dictionaries.
Comments
Note that JDS is going down this route as well
The OpenOffice.org hunspell dictionary list of working dictionaries
The mozilla hunspell dictionary list of tri-licensed dictionaries
The firefox extension list of available language extensions
How to build a dictionary
How to convert an ispell affix to hunspell .aff
A somewhat related issue .
Will help on adding Indic hunspell dictionaries in Fedora - paragn.
php5 and bluefish still link to aspell at least - kmaraas. (It's not practical for me to port everything, just the core default installed components and the default spell-checking solutions for the main desktop environments and applications - caolanm)
Ubuntu is now following the Fedora practice as well.