FastText: A Comprehensive Introduction
FastText is an open-source, efficient, and scalable text representation and classification library developed by Facebook’s Artificial Intelligence Research (FAIR) team. It is designed to handle large-scale datasets, offering fast training and inference for text-based AI tasks. Unlike traditional natural language processing (NLP) methods that treat words as atomic entities, FastText leverages subword information, making it robust to out-of-vocabulary words and spelling variations.
FastText is widely used in tasks like text classification, language modeling, text representations (word embeddings), and machine translation. It offers both supervised and unsupervised learning capabilities, enabling developers and data scientists to handle diverse NLP problems seamlessly.
With its intuitive interface and support for pre-trained models, FastText is a popular choice for many NLP applications. Additionally, it provides APIs for Python, simplifying its integration into existing workflows.
FastText APIs with Explanations and Code Snippets
Below are at least 20 useful FastText function/API explanations with code examples to help you get started.
1. Training a Supervised Text Classification Model
FastText can train models for supervised learning tasks like text classification.
import fasttext # Training a supervised classification model model = fasttext.train_supervised(input="data.txt", lr=0.1, epoch=25, wordNgrams=2) # Save the model model.save_model("model.bin")
train_supervised: Trains a supervised model with parameters like learning rate (lr
), number of epochs (epoch
), and n-grams (wordNgrams
).
2. Loading a Pre-Trained Model
import fasttext # Load a previously trained model model = fasttext.load_model("model.bin") # Use the loaded model print(model.predict("Sample text to classify"))
load_model: Loads a saved FastText model for inference or further training.
3. Predicting a Label for a Text
After training a model, you can use it to predict a label for input text.
# Predict with the model labels, probabilities = model.predict("Hello, I need assistance.", k=2) print("Labels:", labels) print("Probabilities:", probabilities)
predict: Returns the top k
labels and their probabilities for a given text.
4. Getting Word Vector for a Word
# Get word vector for a specific word vector = model.get_word_vector("hello") print("Word Vector for 'hello':", vector)
get_word_vector: Retrieves the vector representation for a given word.
5. Getting Sentence Vector (Text Representation)
Generates a vector representation for an entire sentence or text.
sentence_vector = model.get_sentence_vector("This is an example sentence.") print("Sentence Vector:", sentence_vector)
get_sentence_vector: Computes a vector representation for a piece of text.
6. Retrieve Vocabulary from the Model
# Get words in the vocabulary words = model.get_words() print("Words in Vocabulary:", words)
get_words: Outputs the vocabulary used in the model.
7. Quantizing a Model to Reduce Size
Quantization reduces the size of the model by simplifying the vectors, making it faster for inference.
model.quantize(input="data.txt") model.save_model("quantized_model.ftz")
quantize: Compresses the model to reduce storage and speed up inference.
8. Training Word Vectors
# Unsupervised word vector training word_vector_model = fasttext.train_unsupervised(input="text.txt", model="skipgram", dim=100) word_vector_model.save_model("word_vectors.bin")
train_unsupervised: Trains unsupervised word embeddings using methods like skipgram
or cbow
.
9. Retrieving Nearest Neighbors
# Find nearest neighbors for a word neighbors = word_vector_model.get_nearest_neighbors("king", k=5) print("Nearest Neighbors:", neighbors)
get_nearest_neighbors: Finds words with vector representations most similar to a given word.
10. Getting N-grams from the Model
ngrams = model.get_subwords("example") print("Generated n-grams:", ngrams)
get_subwords: Retrieves subwords (e.g., n-grams) for a given word.
11. Checking Dimensions of Word Vectors
dims = model.get_dimension() print("Dimension of Word Vectors:", dims)
get_dimension: Returns the number of dimensions used in the word embeddings.
12. Updating the Learning Rate During Training
# Update the learning rate dynamically model.set_learning_rate(0.05)
set_learning_rate: Manually sets the learning rate during the training process.
13. Get the Labels from the Model
# Retrieve labels from a classification model labels = model.get_labels() print("Labels:", labels)
get_labels: Lists all possible labels in a classification task.
14. Saving and Exporting Word Vectors
# Save word vectors word_vector_model.save_model("word_vectors.vec")
save_model: Saves the word vectors or classification model.
15. Checking the Learning Rate of the Model
lr = model.get_learning_rate() print("Current Learning Rate:", lr)
get_learning_rate: Retrieves the current learning rate of the training model.
16. Evaluate Word Similarity
# Evaluate similarity between two words similarity = model.get_word_similarity("king", "queen") print("Word similarity:", similarity)
get_word_similarity: Computes cosine similarity between two words.
17. Create and Train CBOW Model
cbow_model = fasttext.train_unsupervised(input="text.txt", model="cbow")
train_unsupervised (cbow): Trains a model using the CBOW method for word embeddings.
18. Export Text Representations
# Export only word vectors to a file with open('word_vectors.txt', 'w') as f: for word, vector in zip(model.get_words(), model.get_word_vectors()): f.write(f"{word} {' '.join(map(str, vector))}\n")
Extracts word vectors and exports them to a file for external use.
19. Parallel Model Training
FastText supports multithreading for faster training.
model = fasttext.train_supervised(input="data.txt", thread=4)
thread: Specifies the number of threads to be used for training.
20. Checking N-gram Range
Retrieve the range of n-grams supported by the trained model.
ngrams_range = model.get_args().wordNgrams print("N-gram Range:", ngrams_range)
get_args(): Accesses hyperparameters of the training model.
Generic Application: Document Classification Pipeline
Here’s an example application using FastText to classify news articles into predefined categories.
Dataset Format
Prepare a dataset (data.txt
) in the following format:
__label__sports A thrilling football match took place last night. __label__politics The newly elected president addressed the nation. ...
Application Code
import fasttext # 1. Train the model model = fasttext.train_supervised(input="data.txt", lr=0.5, epoch=30, wordNgrams=2) # 2. Save the trained model model.save_model("news_classifier.bin") # 3. Load the model for inference loaded_model = fasttext.load_model("news_classifier.bin") # 4. Predict category for new text input text = "The basketball team won the championship game last evening." labels, probabilities = loaded_model.predict(text, k=3) print("Predicted Categories:", labels) print("Confidence Scores:", probabilities) # 5. Check model's labels print("Categories available in the model:", loaded_model.get_labels())
Output Example
Predicted Categories: ['__label__sports'] Confidence Scores: [0.9876] Categories available in the model: ['__label__sports', '__label__politics', '__label__technology']
This illustration showcases how FastText can be seamlessly utilized for tasks like multi-class document classification with high speed and accuracy.
By leveraging its easy-to-use APIs combined with Python scripting, FastText is an extremely powerful tool for a wide range of NLP tasks.