2 minute read

Wordle Valid Solutions Letter Frequency

Wordle is very popular recently, and even had several variations like quordle (a four-words version), mathler (a math equation version), and handle (a Chinese character version). This week, I found this dataset on Kaggle with all the valid guesses and solutions in Wordle. As a non-native speaker, I always struggle to come up with the word matching all the clues. So it’s very important for me to strategize my guesses. Hopefully, this dataset could shed some lights on the best starter guesses.

My Visualization

I calculated the frequency of each letter at each location in the valid solutions dictionary and plotted them in a heatmap with a similar color palette as Wordle :)

Please notice that all the visualizations are designed for desktop view, so it is recommended to view them on a desktop device.

Dashboard link

Insights

  • Generally speaking, vowel letters have higher frequencies;
  • But each vowel letter has their preference on locations, for example, letter ‘E’ has higher frequency on position 4 and 5, while letter ‘A’ shows up more on position 2 and 3;
  • Some of the most common consonant letters are R, T, L and S.

Moreover, to find the best starting word, I removed the words with duplicate letters (to get more clues), and summed up the frequency of letters at each corresponding position. The top words are: saine (1542), soare(1528), saice(1512), slane(1480), and soily(1437) – all starting with ‘S’ as S has much higher frequency at position 1 than all the other letters. The word with highest frequency score not starting with ‘S’ is crane (1378).
Below is the Python code I wrote:

import pandas as pd
import numpy as np
guesses = pd.read_csv("valid_guesses.csv")
solutions = pd.read_csv("valid_solutions.csv")

## parse out the five letters of each valid solution
solutions['letter_1'] = solutions.apply(lambda row: row['word'][0], axis=1)
solutions['letter_2'] = solutions.apply(lambda row: row['word'][1], axis=1)
solutions['letter_3'] = solutions.apply(lambda row: row['word'][2], axis=1)
solutions['letter_4'] = solutions.apply(lambda row: row['word'][3], axis=1)
solutions['letter_5'] = solutions.apply(lambda row: row['word'][4], axis=1)

## calculate frequency for each letter at each position
def freq_table(loc):
    solutions_copy = solutions.copy()
    solutions_copy['letter'] = solutions.apply(lambda row: row['word'][loc], axis=1)
    freq = solutions_copy[['word', 'letter']].groupby('letter').count()
    freq = freq.sort_values(by='word', ascending = False)
    freq.columns = ['freq'+str(loc)]
    freq = freq.reset_index()
    
    return freq
    
letter_1_freq = freq_table(0)
letter_2_freq = freq_table(1)
letter_3_freq = freq_table(2)
letter_4_freq = freq_table(3)
letter_5_freq = freq_table(4)

## remove words with duplicate letters
all_words = pd.concat([guesses, solutions])
all_words['distinct_letters'] = all_words.apply(lambda row: len(row['word']) == len(set(row['word'])), axis=1)
best_guesses = all_words[all_words['distinct_letters']]

## calculate sum of the frequencies for each valid guess word
best_guesses['letter_1'] = best_guesses.apply(lambda row: row['word'][0], axis=1)
best_guesses = best_guesses.merge(letter_1_freq, left_on = 'letter_1', right_on = 'letter', how = 'left')

best_guesses['letter_2'] = best_guesses.apply(lambda row: row['word'][1], axis=1)
best_guesses = best_guesses.merge(letter_2_freq, left_on = 'letter_2', right_on = 'letter', how = 'left')

best_guesses['letter_3'] = best_guesses.apply(lambda row: row['word'][2], axis=1)
best_guesses = best_guesses.merge(letter_3_freq, left_on = 'letter_3', right_on = 'letter', how = 'left')

best_guesses['letter_4'] = best_guesses.apply(lambda row: row['word'][3], axis=1)
best_guesses = best_guesses.merge(letter_4_freq, left_on = 'letter_4', right_on = 'letter', how = 'left')

best_guesses['letter_5'] = best_guesses.apply(lambda row: row['word'][4], axis=1)
best_guesses = best_guesses.merge(letter_5_freq, left_on = 'letter_5', right_on = 'letter', how = 'left')

best_guesses['total_freq'] = best_guesses['freq0'] + best_guesses['freq1'] + best_guesses['freq2'] + best_guesses['freq3'] + best_guesses['freq4']
best_guesses = best_guesses[['word', 'total_freq']].sort_values(by='total_freq', ascending = False)
best_guesses

Follow this link to find more weekly vizzes :)