Native Language Population Weights for SPEL vocabulary generation #SPELdev #linguistics

map of world through language areas

map of the world based on native langauge areas, with different colours illustrating different language families of the world.

Hi All,

So I've decided to go with the native language weights after-all,
since it seems that including L2 makes things too variable over time.

I've used a combination of Ethnologue language families, and the
native language lists to come up with my weights.  It includes the top
100 languages by number of speakers.

Here is the language and family overview:

    /*
      Indo-European 45.72% {
        Indo-Aryan 18.76% {
          Central {
            Hindi   (hi)  4.46%,
            Urdu    (ur)  0.99%,
            Haryanvi  (bgc) 0.21%,
            Awadhi    (awa) 0.33%,
            Chhattisgarhi (hne) 0.19%,
            Dakhini   (dcc) 0.17%
          }
          Eastern {
            Bengali (bn)  3.05%,
            Odia    (or)  0.50%,
            Maithili  (mai) 0.45%,
            Bhojpuri (bho) 0.43%,
            Chittagonian (ctg) 0.24%,
            Assamese    (as)  0.23%,
            Magadhi   (mag) 0.21%,
            Sylheti   (syl) 0.16%,
          }
          North Western {
            Punjabi (pa)  1.44%,
            Saraiki   (skr) 0.26%,
            Sindhi    (sd)  0.39%,
          }
          Western {
            Gujarati (gu)  0.74%,
            Marwari   (mwr) 0.21%,
            Dhundari  (dhd) 0.15%,
          }
          Southern {
            Marathi (mr)  1.1%,
            Sinhalese (si)  0.25%,
            Konkani (kok) 0.11%,
          }
          Northern {
            Nepali    (ne)  0.25%,
          }
          Indo-Aryan Remainder (IAR) 2.24
        }
        Germanic {
          English (en)  5.52%,
          German  (de)  1.39%,
          Dutch   (nl)  0.32%,
          Swedish (sv)  0.13%
        }
        Italic {
          Spanish   (es)  5.85%
          Portuguese (pt) 3.08%
          French    (fr)  1.12%
            Haitian Creole (ht) 0.15%
          Italian   (it)  0.9%
          Romanian  (ro)  0.37%
        }
        Iranian {
          Persian (fa)  0.68%,
          Pashto  (ps)  0.58%,
          Kurdish (ku)  0.31%,
          Balochi (bal) 0.11%,

        }
        Slavic {
          Russian   (ru)  2.42%
          Polish    (pl)  0.61%
          Ukranian  (uk) 0.46%
          Serbo-Croation (hbs)  0.28%
          Czech     (cs)  0.15%
          Belarussian (be) 0.11%
        }
        Hellenic {
          Greek   (el)  0.18%
        }
        European Remainder  2.24
      }
      Sino-Tibetan 21.12% {
        Mandarin  (zh)  14.1%,
        Wu        (wu)  1.2%,
        Cantonese (zhy) 0.89%,
        Jin       (cjy) 0.72%,
        South Min (nan) 0.71%,
        Xiang     (hsn) 0.58%,
        Myanmar   (my)  0.50%,
        Hakka     (hak) 0.46%,
        Gan       (gan) 0.33%,
        North Min (mnp) 0.16%,
        East Min  (cdo) 0.14%,
        Sino-Tiberan Remainder (STR)  1.33
      }
      Hmong-Mien {
        Hmong   (hmx) 0.13%
      }
      Niger-Congo 6.93% {
        Swahili   (sw)
        Yoruba    (yo)  0.42%,
        Fula      (ff)  0.37%,
        Igbo      (ig)  0.36%,
        Chewa     (ny)  0.17%,
        Akan      (ak)  0.17%,
        Zulu      (zu)  0.16%,
        Kinyarwanda (rw)  0.15%,
        Kirundi   (rn)  0.13%,
        Shona     (sn)  0.13%,
        Mossi     (mos) 0.11%,
        Xhosa     (xh)  0.11%,
        Niger-Congo Remainder (NCR) 4.65
      }
      Afro-asiatic 6.33% {
        Arabic    (ar)  4.23%,
        Hausa     (ha)  0.52%,
        Amharic   (am)  0.37%,
        Oromo     (om)  0.36%,
        Somali    (so)  0.22%,
        Afro-Asiatic Remainder (AR) 0.63
      }
      Austronesian 4.99% {
        Indonesian  (id)  1.16%,
        Javanese    (jv)  1.25%,
        Sundanese   (su)  0.57%,
        Tagalog     (tl)  0.42%,
        Cebuano     (ceb) 0.32%,
        Malagasy    (mg)  0.28%,
        Madurese    (mad) 0.23%,
        Ilocano     (ilo) 0.14%,
        Hiligaynon  (hil) 0.12%,
        Austronesian Remainder (ANR) 0.5
      }
      Dravidian   3.5% {
        Telugu    (tl) 1.15%,
        Tamil     (ta) 1.06%,
        Kannada   (kn) 0.58%,
        Malaylam  (ml) 0.57%
        Dravidian Remainder (DR) 0.14
      }
      Turkic 2.64% {
        Turkish   (tr)  0.95%,
        Uzbek     (uz)  0.39%,
        Azerbajani (az) 0.34%,
        Turkmen   (tk)  0.24%,
        Kazakh    (kk)  0.17%,
        Uyghur    (ug)  0.12%,
        Turkic Remainder (TR) 0.43
      }
      Japonic 1.99% {
        Japanese  (jp)  1.92%
        Japonic Remainder (JR) 0.07
      }
      Austro-Asiatic 1.58% {
        Vietnamese  (vi)  1.14%,
        Khmer       (km)  0.24%
        Austro Asiatic Remainder (AAR) 0.2
      }
      Tai-Kadai 1.24% {
        Thai  (th) 0.86%
        Zhuang (za) 0.24%
        Tai-Kadai Remainder (TKR) 0.14
      }
      Koreanic 1.19% {
        Korean  1.14%
        Koreanic Remainer (KR)  0.05
      }
      Uralic 0.32% {
        Hungarian   (hu)  0.19
        Uralic Remainder (UR)   0.13
      }
      Kartvelian 0.08% {
        Georgian
      }
      total 97.63%

      Native Speakers

    */


And here are the weights as I managed to get them. Due to limitations
of both google translate and espeak-ng not all languages have their
words or phonemes available, however they were grouped with the
closest langauge family.

The most unfortunate is likely a complete abscense of the Tai-Kadai
family representation as it has no espeak-ng support, however it has
been grouped with the Austronesian languages which some believe it is
related to.
    NativeWeights = {
      "zh": 14.1 + /*wu*/ 1.2% + /*cjy*/ 0.72 + /*hsn*/ 0.58 + /*hak*/
0.46 +
            /*gan*/ 0.33 + /*mnp*/ 0.16 +/*cdo*/ 0.14 + /*hmx*/ 0.13 +
            /*STR*/ 1.33,
      "es": 5.85,
      "en": 5.52 + /*ER*/ 2.24,
      "hi": 4.46  + /*awa*/ 0.33 + /*bgc*/ 0.21 + /*hne*/ 0.19 +
            /*dcc*/  0.17 + /*IAR*/ 2.24,
      "ar": 4.23 + /*ha*/ 0.52 + /*AR*/ 0.63, "pt": 3.08,
      "bn": 3.05 + /*ctg*/ 0.24 + /*as*/ 0.23 + /*bho*/ 0.43 + /*mai*/
0.45 +
           /*or*/ 0.5  + /*mag*/ 0.21,
      "ru": 2.42 + /*uk*/ 0.46 + /*be*/ 0.11 + /*hbs*/ 0.28,
      "pa": 1.44 + /*skr*/ 0.26 + /*sd*/ 0.39 ,
      "de": 1.39,
      "id": 1.16 + /*jv*/ 1.25 + /*th*/ 0.86 + /*su*/ 0.57 + /*tl*/ 0.42
 +
            /*ceb*/ 0.32 + /*mg*/ 0.28 + /*mad*/ 0.23 + /*ilo*/ 0.14 +
            /*hil*/ 0.12 + /*ANR*/ 0.5 + /*TKR*/ 0.14,
      "te": 1.15,
      "ta": 1.06 + /*DR*/ 0.14,
      "vi": 1.14 + /*km*/ 0.24 + /*AAR*/ 0.2,
      "ko": 1.14 + /*jp*/ 1.92 + /*JR*/ 0.07 + /*KR*/ 0.05,
      "fr": 1.12 + /*ht*/ 0.15, "mr": 1.1 + /*kok*/ 0.11,
      "ur": 0.99,
      "tr": 0.95 + /*uz*/ 0.39 + /*tk*/ 0.24 + /*ug*/ 0.12 +/*TR*/ 0.43,
      "it": 0.9,
      "zhy": 0.89 + /*nan*/ 0.71,
      "gu": 0.74 + /*mwr*/ 0.21 + /*dhd*/ 0.15,
      "fa": 0.68 + /*ps*/ 0.58 + /*bal*/ 0.11,
      "pl": 0.61, "kn": 0.58,
      "ml": 0.57, "my": 0.50,
      "sw": /*yo*/ 0.42 + /*ff*/ 0.37 + /*ny*/ 0.17 + /*ak*/ 0.17 +
            /*rw*/ 0.15 + /*rn*/ 0.13 + /*sn*/ 0.13 + /*mos*/ 0.11 +
            /*xh*/ 0.11 + /*NCR*/ 4.65,
      "am": 0.37 + /*om*/ 0.36 + /*so*/ 0.22, "ro": 0.37,
      "az": 0.34, "nl": 0.32, "ku": 0.31,
      "ne": 0.25, "si": 0.25,
      "hu": 0.19 + /*UR*/ 0.13, "el": 0.18,
      "cs": 0.15, "sv": 0.13, "ka": 0.08
    },



So now future versions of the SPEL vocabulary will include the
percentage of world language speakers which may find something
reminiscent in the words.
— Logan Streondj,