In this tutorial I show how to use fastText and a few lines of Python script to detect alternate spellings and words for any term in a large text corpus (such as Twitter or Reddit posts). For example if you want to detect all posts in a corpus where people mention a light-duty truck (e.g. a Ford F150 or Totota Hilux), people my use many different terms; bakkie (a term used in South Africa), ute (used in Australia), pickup (U.S.A.) or even brand or model names e.g. pik-up (Mahindra Pik-Up).
The method I describe detects alternate and misspellings and provides a score based on likely similarity. The script I use in the video instructions may be found here.
Need a Python Tutorial?
If you haven’t previously used Python then the short video tutorial below will help.