J. Carlos Roldán / The longest word chain sequence

Oct 03, 2016 / code

The longest word chain sequence

You're likely familiar with this game. I recently wondered about the longest possible chain of words that could be formed. This post aims to find it.

There are many variations of this game, so let's set some specific rules:

The first player says a word.
The following players must say a word beginning with the last syllable of the previous word, accents ignored.
A word cannot be repeated.
Only words from the 23rd edition of the RAE dictionary are valid. Proper nouns, acronyms, and monosyllables are excluded.
Accents are ignored when chaining words.

Data

First, we need all the words from the Spanish dictionary. Although the RAE doesn't provide a complete list, Giusseppe Domínguez offers one.

Next, we need syllabic decomposition of all the words. This is complex due to many rules and exceptions, but thankfully, Pyphen handles it well.

We end up with a list of 86,993 syllabically separated valid words.

Words by number of syllables in Spanish

Longest Sequence

To calculate the longest possible sequence, we construct a directed graph with words as nodes and edges connecting chainable words.

Word chain graph sample

The challenge then becomes finding the longest directed path in this graph. Unlike finding the shortest path, finding the longest in a graph is NP-complete. With a graph of 86,993 nodes and 43,053,416 edges, it would take 1.82 × 10^10¹² years to find the perfect solution, so we must try approximate solutions.

Random Path

Choosing paths randomly, the longest found is only 92 words:

sopa -> pantalla -> llaneza -> […] -> ramuja -> jagüel -> güeldrés

This is due to easily reaching a sink—a word that can be reached but not exited, like güeldrés.

Other Approaches

Other methods to tackle this problem include:

Performing a depth-first search of the graph and keeping the longest path.
Always choosing words with more common endings.
Eliminating sink nodes before random searches.

Can you think of other ways to solve this? What's the longest chain you've managed to create? Feel free to comment with your results.

Comentarios

Giusseppe Domínguez

Oct 31, 2020

Me alegra saber en qué se está empleando el diccionario (listado de palabras del mismo) que tecleé hace tiempo. Como digo, está tecleado a mano, así que incluye un error que he estimado en torno al 1,5%. Estoy, durante estos días, realizando una corrección del mismo, así como una recopilación de las definiciones en texto plano de todas palabras del diccionario.

Si te interesa, no dudes en contactarme.
Un cordial saludo,
Giusseppe Domínguez

Juan C. Roldán

Nov 03, 2020

Replying to Giusseppe Domínguez:

Muchas gracias Giusseppe. En una investigación que iniciamos hace poco, necesitábamos un corpus de este tipo que tuviera la mayor cantidad posible de palabras, y decidimos usar un repositorio de subtítulos de 234,000 series y películas para extraerlas: http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/es.zip

Pero sin duda una corrección con las definiciones nos será útil si me la puedes facilitar :) espero tu respuesta

Nov 04, 2020

En cuanto termine de analizar las (aprox) 1000 palabras que no se han encontrado en la RAE vía web (con un programa que automáticamente las consulta, hecho en Bash/Linux, que si quieres también te puedo proporcionar), será un placer ponerlas a disposición pública.

Data

Longest Sequence

Random Path

Other Approaches

Comentarios

Leave a comment