The longest word chain sequence

You're likely familiar with this game. I recently wondered about the longest possible chain of words that could be formed. This post aims to find it.

There are many variations of this game, so let's set some specific rules:

  1. The first player says a word.
  2. The following players must say a word beginning with the last syllable of the previous word, accents ignored.
  3. A word cannot be repeated.
  4. Only words from the 23rd edition of the RAE dictionary are valid. Proper nouns, acronyms, and monosyllables are excluded.
  5. Accents are ignored when chaining words.

Data

First, we need all the words from the Spanish dictionary. Although the RAE doesn't provide a complete list, Giusseppe Domínguez offers one.

Next, we need syllabic decomposition of all the words. This is complex due to many rules and exceptions, but thankfully, Pyphen handles it well.

We end up with a list of 86,993 syllabically separated valid words.

Words by number of syllables in Spanish

Longest Sequence

To calculate the longest possible sequence, we construct a directed graph with words as nodes and edges connecting chainable words.

Word chain graph sample

The challenge then becomes finding the longest directed path in this graph. Unlike finding the shortest path, finding the longest in a graph is NP-complete. With a graph of 86,993 nodes and 43,053,416 edges, it would take 1.82 × 101012 years to find the perfect solution, so we must try approximate solutions.

Random Path

Choosing paths randomly, the longest found is only 92 words:

sopa -> pantalla -> llaneza -> […] -> ramuja -> jagüel -> güeldrés

This is due to easily reaching a sink—a word that can be reached but not exited, like güeldrés.

Other Approaches

Other methods to tackle this problem include:

Can you think of other ways to solve this? What's the longest chain you've managed to create? Feel free to comment with your results.

Comentarios

Giusseppe Domínguez Reply
Me alegra saber en qué se está empleando el diccionario (listado de palabras del mismo) que tecleé hace tiempo. Como digo, está tecleado a mano, así que incluye un error que he estimado en torno al 1,5%. Estoy, durante estos días, realizando una corrección del mismo, así como una recopilación de las definiciones en texto plano de todas palabras del diccionario.

Si te interesa, no dudes en contactarme.
Un cordial saludo,
Giusseppe Domínguez
Juan C. Roldán Reply

Replying to Giusseppe Domínguez:

Muchas gracias Giusseppe. En una investigación que iniciamos hace poco, necesitábamos un corpus de este tipo que tuviera la mayor cantidad posible de palabras, y decidimos usar un repositorio de subtítulos de 234,000 series y películas para extraerlas: http://opus.nlpl.eu/download.php?f=OpenSubtitles/v2018/xml/es.zip

Pero sin duda una corrección con las definiciones nos será útil si me la puedes facilitar :) espero tu respuesta
Giusseppe Domínguez Reply
En cuanto termine de analizar las (aprox) 1000 palabras que no se han encontrado en la RAE vía web (con un programa que automáticamente las consulta, hecho en Bash/Linux, que si quieres también te puedo proporcionar), será un placer ponerlas a disposición pública.

Leave a comment

Get a mail
4d8cd43bbbfbbd2b7aed08d9a2b0ef251cebfd3e2603b74b710a2d38b7f8ec39