Algorithmic World Lang, SPEL IL, International Auxilary Language

Hi language enthusiasts,

As you may know I'm working on SPEL (Speakable Programming for Every
Language),  it has an Intermediary-Language (IL) which is used by the VM,
however while easily parse-able it is also speakable. Anyways so
recently I realized that my phonotactics can be improved a little, so
decided should redefine the whole vocabulary base, but since doing it
manually last time took me nearly half a year for a mere 1000 words,
I'd like to automate the process this time.

I would like us all to get our brains together and come up with a
root-word-generating, concept naming algorithm the super majority of us
can agree on. "naming and caching are the two biggest problems in
programming",  An average human can see and remember 4 things in
short-term memory (cache),  so having root words that are up to 4
glyphs long is optimal. Naming is a big problem in programming, with
different API's choosing different words, and different ordering of
words.  Hopefully with SPEL the naming process can become a
standardized algorithmic process. Note that while the words discussed
are in the intermediary language, all human languages will benefit as
they can be translated to and from it.

There are a few aspects to Concept Name Root Word Generating,
* Phonology
* Phonotactics
* Source Languages
* Word to concept matching


I'm fairly sure just about everyone agrees on the phonology I use, as no
one has much complained about it.  Here is my definition for the 24
glyph alphabet,
with word order based on the same as in phoibles.
note y = /j/, c = /ʃ/, j = /ʒ/ , a = /ä/ otherwise they are all IPA
var Glyph24Alphabet =



so I've talked about phonotactics before, here and on the conlang
mailing list, and now Victor Chan has brought it up.

So with his system, assuming a 16 glyph alphabet there are 663 possible
syllables.  If we assume affricates are okay, then that brings it up to
852, or 1101 if /l/ can be second and final.

with 24 glyph alphabet, (which is about average for world langs),
then have 2825 syllables with only glides as seconds or 4440 syllables
if including affricates or 5495 if including /l/ as second or last.

a 1000 words gives about 80% fluency, 3000 gives 90% fluency,
around 8000 is average fluency,  with good writers having around 15,000,
and great writers as much as 30,000

However for an IAL while there would certainly need to be room for as
much as English (million words +), however the specialty words can be
more complex, either compounds or using greater range of phonemes.

So I gues 852 core and 4440 for fluent vocab should be enough,
I know my wife has trouble pronouncing /tla/ or any /tl/ initial,
so I'm guessing she's not the only one.  Chan mentioned it for the
Chinese speakers which are more than a billion, so making a language
easy for them to learn is important.

*Source Languages*

For the purpose of my algorithms I can only use language which are
included in Google Translate,  though not all of them as that would
unfairly bias it towards those languages which are represented.  Thus
I've decided to go for several languages, each of which represent major
language families.

According to wikipedia half of IE is actually the Indo-Aryan family,
with about 1.5 billion people.  That is more than Chinese!

So I'll be including Hindi weighted with 1,500 million people of the
Indo-Aryan family.  That certainly makes it the largest. Followed
closely behind by Mandarin Chinese, with some 1.3 billion.

Though English seems only to have around 840 million,
I'm wondering if anyone has updated statistics for it,
since everyone loves to rave how international it is.

The fourth biggest group seems to be Niger-Congo,
with 600 million people which will be represented by Swahili.

Then there is Arabic, representing Afro-Asiatic with 490 million.

Indonesian representing Austronesian with 486 million.

Turkish representing Turkic, Korean and Japanese with 377 million.

Russian representing Slavic with 315 million.

Tamil representing Dravidian with 210 million.

Farsi representing Iranian languages with 200 million.

Finnish representing Uralic with 28 million

Georgian, Kartvelian 5 million

Welsh, Celtic 2 million.

Swedish, North Germanic, 21 million.

And Spanish 490, French 220, German 145, Portuguese 200, Italian 64
Greek 13 mil, will be representing themselves since they have espeak

I can also decompose Slavic to it's main branch constiuents,
Lithuanian (Baltic 5mil), Russian (Eastern Slavic 325mil), Polish
(Western Slavic 57mil), Serbian (South Slavic 32mil).

Anyways wondering what you think about this world-lang source
distribution,  and if I've missed anything available on both google
translate and espeak (for phonemic transcription)

Interestingly I'm guessing this world-lang is mostly going to sound like
Hindi with a mix of Mandarin. Which is going to be quite different to
most of the "eurolangs" Auxlangs put out thus far.

If you know any other major language families which are available on
Google Translate, but are not listed here,  have alternative
translation-engines or have qualms with the ones I have listed then
please comment.

*Word To Concept Matching*

Now this is actually one of the most complicated parts.
Basically I take the proto-language approach, that says if a bunch of
langauges have a word or phoneme in common, then it is a good one to use
for the proto-language, or in this case for the auxlang.
Thus words have the most common phonemes which are represented in world

However, due to the limited syllable space, and that some phonemes are
simply more common than others, not all words can have their ideal
phoneme set.  To distinguish which "deserve" to get closer to the ideal,
can use usage lists, or word-frequency lists.  So if a word is used more
often, it deserves to be closer to the ideal phonemes.

Additionally there can be part of the algorithm that can identify the
optimal place for the phoneme representations,  for instance if the
phoneme is within the first few letters of the word,  then it is most
likely to become the first or second letter of the IAL root.   similarly
if it is near the end, then it is more likely to become the final

If there aren't really any good options, and only a few relatively
phonemes are available, then the goal would be to approximate a word
from one of the existing languages,.

The algorithm would accept four inputs, the list of words to define,
the word frequency list, the list of currently defined and possible
words and source translations with phonemic transcriptions.  It would
organize the words to define based on their frequency,  with the ones
that are most frequent being defined first.

It would output how many of each phoneme are found in the source
what are the popular starter phonemes, what are the popular final
Then it would look at the list of defined and possible words,  and see
which of the possible words may match either the begining phoneme, the
ending phoneme, and the central vowel, if necessary the secondary
phoneme. Based on what it finds it would output 4 of the top
possibilities for that word,
or if it is a good enough match, then it may simply define it by itself.

Having an algorithm that can do this is quite important in the real
world of IAL, since making vocabulary sucks up so much time, it took me
months to make the 1000 or so current words of Mwak/Lank, which I think
is really ridiculous,  but in so doing I think I've analyzed the
algorithm I used to make it,  which is the above.

If you have some more suggestions for it, from your own experience with
worldlang creations, please share,  we can all benefit :-).


Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s