FastText A Comprehensive Introduction

FastText: A Comprehensive Introduction

FastText is an open-source, efficient, and scalable text representation and classification library developed by Facebook’s Artificial Intelligence Research (FAIR) team. It is designed to handle large-scale datasets, offering fast training and inference for text-based AI tasks. Unlike traditional natural language processing (NLP) methods that treat words as atomic entities, FastText leverages subword information, making it robust to out-of-vocabulary words and spelling variations.

FastText is widely used in tasks like text classification, language modeling, text representations (word embeddings), and machine translation. It offers both supervised and unsupervised learning capabilities, enabling developers and data scientists to handle diverse NLP problems seamlessly.

With its intuitive interface and support for pre-trained models, FastText is a popular choice for many NLP applications. Additionally, it provides APIs for Python, simplifying its integration into existing workflows.

FastText APIs with Explanations and Code Snippets

Below are at least 20 useful FastText function/API explanations with code examples to help you get started.

1. Training a Supervised Text Classification Model

FastText can train models for supervised learning tasks like text classification.

  import fasttext

  # Training a supervised classification model
  model = fasttext.train_supervised(input="data.txt", lr=0.1, epoch=25, wordNgrams=2)

  # Save the model
  model.save_model("model.bin")

train_supervised: Trains a supervised model with parameters like learning rate (lr), number of epochs (epoch), and n-grams (wordNgrams).

2. Loading a Pre-Trained Model

  import fasttext

  # Load a previously trained model
  model = fasttext.load_model("model.bin")

  # Use the loaded model
  print(model.predict("Sample text to classify"))

load_model: Loads a saved FastText model for inference or further training.

3. Predicting a Label for a Text

After training a model, you can use it to predict a label for input text.

  # Predict with the model
  labels, probabilities = model.predict("Hello, I need assistance.", k=2)

  print("Labels:", labels)
  print("Probabilities:", probabilities)

predict: Returns the top k labels and their probabilities for a given text.

4. Getting Word Vector for a Word

  # Get word vector for a specific word
  vector = model.get_word_vector("hello")
  print("Word Vector for 'hello':", vector)

get_word_vector: Retrieves the vector representation for a given word.

5. Getting Sentence Vector (Text Representation)

Generates a vector representation for an entire sentence or text.

  sentence_vector = model.get_sentence_vector("This is an example sentence.")
  print("Sentence Vector:", sentence_vector)

get_sentence_vector: Computes a vector representation for a piece of text.

6. Retrieve Vocabulary from the Model

  # Get words in the vocabulary
  words = model.get_words()
  print("Words in Vocabulary:", words)

get_words: Outputs the vocabulary used in the model.

7. Quantizing a Model to Reduce Size

Quantization reduces the size of the model by simplifying the vectors, making it faster for inference.

  model.quantize(input="data.txt")
  model.save_model("quantized_model.ftz")

quantize: Compresses the model to reduce storage and speed up inference.

8. Training Word Vectors

  # Unsupervised word vector training
  word_vector_model = fasttext.train_unsupervised(input="text.txt", model="skipgram", dim=100)

  word_vector_model.save_model("word_vectors.bin")

train_unsupervised: Trains unsupervised word embeddings using methods like skipgram or cbow.

9. Retrieving Nearest Neighbors

  # Find nearest neighbors for a word
  neighbors = word_vector_model.get_nearest_neighbors("king", k=5)
  print("Nearest Neighbors:", neighbors)

get_nearest_neighbors: Finds words with vector representations most similar to a given word.

10. Getting N-grams from the Model

  ngrams = model.get_subwords("example")
  print("Generated n-grams:", ngrams)

get_subwords: Retrieves subwords (e.g., n-grams) for a given word.

11. Checking Dimensions of Word Vectors

  dims = model.get_dimension()
  print("Dimension of Word Vectors:", dims)

get_dimension: Returns the number of dimensions used in the word embeddings.

12. Updating the Learning Rate During Training

  # Update the learning rate dynamically
  model.set_learning_rate(0.05)

set_learning_rate: Manually sets the learning rate during the training process.

13. Get the Labels from the Model

  # Retrieve labels from a classification model
  labels = model.get_labels()
  print("Labels:", labels)

get_labels: Lists all possible labels in a classification task.

14. Saving and Exporting Word Vectors

  # Save word vectors
  word_vector_model.save_model("word_vectors.vec")

save_model: Saves the word vectors or classification model.

15. Checking the Learning Rate of the Model

  lr = model.get_learning_rate()
  print("Current Learning Rate:", lr)

get_learning_rate: Retrieves the current learning rate of the training model.

16. Evaluate Word Similarity

  # Evaluate similarity between two words
  similarity = model.get_word_similarity("king", "queen")
  print("Word similarity:", similarity)

get_word_similarity: Computes cosine similarity between two words.

17. Create and Train CBOW Model

  cbow_model = fasttext.train_unsupervised(input="text.txt", model="cbow")

train_unsupervised (cbow): Trains a model using the CBOW method for word embeddings.

18. Export Text Representations

  # Export only word vectors to a file
  with open('word_vectors.txt', 'w') as f:
      for word, vector in zip(model.get_words(), model.get_word_vectors()):
          f.write(f"{word} {' '.join(map(str, vector))}\n")

Extracts word vectors and exports them to a file for external use.

19. Parallel Model Training

FastText supports multithreading for faster training.

  model = fasttext.train_supervised(input="data.txt", thread=4)

thread: Specifies the number of threads to be used for training.

20. Checking N-gram Range

Retrieve the range of n-grams supported by the trained model.

  ngrams_range = model.get_args().wordNgrams
  print("N-gram Range:", ngrams_range)

get_args(): Accesses hyperparameters of the training model.

Generic Application: Document Classification Pipeline

Here’s an example application using FastText to classify news articles into predefined categories.

Dataset Format

Prepare a dataset (data.txt) in the following format:

  __label__sports A thrilling football match took place last night.
  __label__politics The newly elected president addressed the nation.
  ...

Application Code

  import fasttext

  # 1. Train the model
  model = fasttext.train_supervised(input="data.txt", lr=0.5, epoch=30, wordNgrams=2)

  # 2. Save the trained model
  model.save_model("news_classifier.bin")

  # 3. Load the model for inference
  loaded_model = fasttext.load_model("news_classifier.bin")

  # 4. Predict category for new text input
  text = "The basketball team won the championship game last evening."
  labels, probabilities = loaded_model.predict(text, k=3)

  print("Predicted Categories:", labels)
  print("Confidence Scores:", probabilities)

  # 5. Check model's labels
  print("Categories available in the model:", loaded_model.get_labels())

Output Example

  Predicted Categories: ['__label__sports']
  Confidence Scores: [0.9876]
  Categories available in the model: ['__label__sports', '__label__politics', '__label__technology']

This illustration showcases how FastText can be seamlessly utilized for tasks like multi-class document classification with high speed and accuracy.

By leveraging its easy-to-use APIs combined with Python scripting, FastText is an extremely powerful tool for a wide range of NLP tasks.