Hi All, So I've decided to go with the native language weights after-all, since it seems that including L2 makes things too variable over time. I've used a combination of Ethnologue language families, and the native language lists to come up with my weights. It includes the top 100 languages by number of speakers. Here is the language and family overview: /* Indo-European 45.72% { Indo-Aryan 18.76% { Central { Hindi (hi) 4.46%, Urdu (ur) 0.99%, Haryanvi (bgc) 0.21%, Awadhi (awa) 0.33%, Chhattisgarhi (hne) 0.19%, Dakhini (dcc) 0.17% } Eastern { Bengali (bn) 3.05%, Odia (or) 0.50%, Maithili (mai) 0.45%, Bhojpuri (bho) 0.43%, Chittagonian (ctg) 0.24%, Assamese (as) 0.23%, Magadhi (mag) 0.21%, Sylheti (syl) 0.16%, } North Western { Punjabi (pa) 1.44%, Saraiki (skr) 0.26%, Sindhi (sd) 0.39%, } Western { Gujarati (gu) 0.74%, Marwari (mwr) 0.21%, Dhundari (dhd) 0.15%, } Southern { Marathi (mr) 1.1%, Sinhalese (si) 0.25%, Konkani (kok) 0.11%, } Northern { Nepali (ne) 0.25%, } Indo-Aryan Remainder (IAR) 2.24 } Germanic { English (en) 5.52%, German (de) 1.39%, Dutch (nl) 0.32%, Swedish (sv) 0.13% } Italic { Spanish (es) 5.85% Portuguese (pt) 3.08% French (fr) 1.12% Haitian Creole (ht) 0.15% Italian (it) 0.9% Romanian (ro) 0.37% } Iranian { Persian (fa) 0.68%, Pashto (ps) 0.58%, Kurdish (ku) 0.31%, Balochi (bal) 0.11%, } Slavic { Russian (ru) 2.42% Polish (pl) 0.61% Ukranian (uk) 0.46% Serbo-Croation (hbs) 0.28% Czech (cs) 0.15% Belarussian (be) 0.11% } Hellenic { Greek (el) 0.18% } European Remainder 2.24 } Sino-Tibetan 21.12% { Mandarin (zh) 14.1%, Wu (wu) 1.2%, Cantonese (zhy) 0.89%, Jin (cjy) 0.72%, South Min (nan) 0.71%, Xiang (hsn) 0.58%, Myanmar (my) 0.50%, Hakka (hak) 0.46%, Gan (gan) 0.33%, North Min (mnp) 0.16%, East Min (cdo) 0.14%, Sino-Tiberan Remainder (STR) 1.33 } Hmong-Mien { Hmong (hmx) 0.13% } Niger-Congo 6.93% { Swahili (sw) Yoruba (yo) 0.42%, Fula (ff) 0.37%, Igbo (ig) 0.36%, Chewa (ny) 0.17%, Akan (ak) 0.17%, Zulu (zu) 0.16%, Kinyarwanda (rw) 0.15%, Kirundi (rn) 0.13%, Shona (sn) 0.13%, Mossi (mos) 0.11%, Xhosa (xh) 0.11%, Niger-Congo Remainder (NCR) 4.65 } Afro-asiatic 6.33% { Arabic (ar) 4.23%, Hausa (ha) 0.52%, Amharic (am) 0.37%, Oromo (om) 0.36%, Somali (so) 0.22%, Afro-Asiatic Remainder (AR) 0.63 } Austronesian 4.99% { Indonesian (id) 1.16%, Javanese (jv) 1.25%, Sundanese (su) 0.57%, Tagalog (tl) 0.42%, Cebuano (ceb) 0.32%, Malagasy (mg) 0.28%, Madurese (mad) 0.23%, Ilocano (ilo) 0.14%, Hiligaynon (hil) 0.12%, Austronesian Remainder (ANR) 0.5 } Dravidian 3.5% { Telugu (tl) 1.15%, Tamil (ta) 1.06%, Kannada (kn) 0.58%, Malaylam (ml) 0.57% Dravidian Remainder (DR) 0.14 } Turkic 2.64% { Turkish (tr) 0.95%, Uzbek (uz) 0.39%, Azerbajani (az) 0.34%, Turkmen (tk) 0.24%, Kazakh (kk) 0.17%, Uyghur (ug) 0.12%, Turkic Remainder (TR) 0.43 } Japonic 1.99% { Japanese (jp) 1.92% Japonic Remainder (JR) 0.07 } Austro-Asiatic 1.58% { Vietnamese (vi) 1.14%, Khmer (km) 0.24% Austro Asiatic Remainder (AAR) 0.2 } Tai-Kadai 1.24% { Thai (th) 0.86% Zhuang (za) 0.24% Tai-Kadai Remainder (TKR) 0.14 } Koreanic 1.19% { Korean 1.14% Koreanic Remainer (KR) 0.05 } Uralic 0.32% { Hungarian (hu) 0.19 Uralic Remainder (UR) 0.13 } Kartvelian 0.08% { Georgian } total 97.63% Native Speakers */ And here are the weights as I managed to get them. Due to limitations of both google translate and espeak-ng not all languages have their words or phonemes available, however they were grouped with the closest langauge family. The most unfortunate is likely a complete abscense of the Tai-Kadai family representation as it has no espeak-ng support, however it has been grouped with the Austronesian languages which some believe it is related to. NativeWeights = { "zh": 14.1 + /*wu*/ 1.2% + /*cjy*/ 0.72 + /*hsn*/ 0.58 + /*hak*/ 0.46 + /*gan*/ 0.33 + /*mnp*/ 0.16 +/*cdo*/ 0.14 + /*hmx*/ 0.13 + /*STR*/ 1.33, "es": 5.85, "en": 5.52 + /*ER*/ 2.24, "hi": 4.46 + /*awa*/ 0.33 + /*bgc*/ 0.21 + /*hne*/ 0.19 + /*dcc*/ 0.17 + /*IAR*/ 2.24, "ar": 4.23 + /*ha*/ 0.52 + /*AR*/ 0.63, "pt": 3.08, "bn": 3.05 + /*ctg*/ 0.24 + /*as*/ 0.23 + /*bho*/ 0.43 + /*mai*/ 0.45 + /*or*/ 0.5 + /*mag*/ 0.21, "ru": 2.42 + /*uk*/ 0.46 + /*be*/ 0.11 + /*hbs*/ 0.28, "pa": 1.44 + /*skr*/ 0.26 + /*sd*/ 0.39 , "de": 1.39, "id": 1.16 + /*jv*/ 1.25 + /*th*/ 0.86 + /*su*/ 0.57 + /*tl*/ 0.42 + /*ceb*/ 0.32 + /*mg*/ 0.28 + /*mad*/ 0.23 + /*ilo*/ 0.14 + /*hil*/ 0.12 + /*ANR*/ 0.5 + /*TKR*/ 0.14, "te": 1.15, "ta": 1.06 + /*DR*/ 0.14, "vi": 1.14 + /*km*/ 0.24 + /*AAR*/ 0.2, "ko": 1.14 + /*jp*/ 1.92 + /*JR*/ 0.07 + /*KR*/ 0.05, "fr": 1.12 + /*ht*/ 0.15, "mr": 1.1 + /*kok*/ 0.11, "ur": 0.99, "tr": 0.95 + /*uz*/ 0.39 + /*tk*/ 0.24 + /*ug*/ 0.12 +/*TR*/ 0.43, "it": 0.9, "zhy": 0.89 + /*nan*/ 0.71, "gu": 0.74 + /*mwr*/ 0.21 + /*dhd*/ 0.15, "fa": 0.68 + /*ps*/ 0.58 + /*bal*/ 0.11, "pl": 0.61, "kn": 0.58, "ml": 0.57, "my": 0.50, "sw": /*yo*/ 0.42 + /*ff*/ 0.37 + /*ny*/ 0.17 + /*ak*/ 0.17 + /*rw*/ 0.15 + /*rn*/ 0.13 + /*sn*/ 0.13 + /*mos*/ 0.11 + /*xh*/ 0.11 + /*NCR*/ 4.65, "am": 0.37 + /*om*/ 0.36 + /*so*/ 0.22, "ro": 0.37, "az": 0.34, "nl": 0.32, "ku": 0.31, "ne": 0.25, "si": 0.25, "hu": 0.19 + /*UR*/ 0.13, "el": 0.18, "cs": 0.15, "sv": 0.13, "ka": 0.08 }, So now future versions of the SPEL vocabulary will include the percentage of world language speakers which may find something reminiscent in the words.
— Logan Streondj,