Scikit-learn: The Go-to Library for Machine Learning in Python
When it comes to building machine learning models in Python, Scikit-learn is one of the most widely used and versatile libraries in the ecosystem. Built on top of NumPy, SciPy, and Matplotlib, Scikit-learn provides simple and efficient tools for data mining, machine learning, and data analysis. Whether you’re a beginner in machine learning or an expert, Scikit-learn’s intuitive API, rich documentation, and powerful features make it the perfect choice for developing machine learning workflows.
Some of the standout features of Scikit-learn include:
- Preprocessing: Tools for data preprocessing and feature engineering.
- Model Selection: Utilities for optimizing and evaluating models.
- Supervised Learning: Algorithms for classification and regression problems.
- Unsupervised Learning: Algorithms for clustering and dimensionality reduction.
- Pipeline: Integration of preprocessing and modeling steps for streamlined workflows.
In this post, we’ll explore the key APIs of Scikit-learn along with examples to demonstrate their use. Additionally, we’ll provide a generic application that combines multiple Scikit-learn functionalities into a complete project.
Key Scikit-learn APIs with Code Snippets
Here, we list 20+ useful APIs in Scikit-learn.
1. train_test_split
Splits the dataset into training and testing subsets.
from sklearn.model_selection import train_test_split from sklearn.datasets import load_iris # Load dataset X, y = load_iris(return_X_y=True) # Split dataset X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42) print("Train size:", X_train.shape, " | Test size:", X_test.shape)
2. StandardScaler
Standardizes features by scaling to zero mean and unit variance.
from sklearn.preprocessing import StandardScaler # Example dataset data = [[1, 2], [2, 3], [3, 4]] # Standardize the dataset scaler = StandardScaler() data_scaled = scaler.fit_transform(data) print("Scaled Data:\n", data_scaled)
3. MinMaxScaler
Scales features to a specific range, typically [0, 1].
from sklearn.preprocessing import MinMaxScaler # Example data data = [[1, 2], [2, 3], [3, 4]] # Scale data to [0, 1] range scaler = MinMaxScaler() data_scaled = scaler.fit_transform(data) print("Scaled Data:\n", data_scaled)
4. LabelEncoder
Encodes categorical labels as integers.
from sklearn.preprocessing import LabelEncoder labels = ['cat', 'dog', 'cat', 'mouse'] encoder = LabelEncoder() encoded_labels = encoder.fit_transform(labels) print("Encoded Labels:", encoded_labels)
5. OneHotEncoder
Encodes categorical features as one-hot numeric arrays.
from sklearn.preprocessing import OneHotEncoder import numpy as np # Example categorical data data = np.array(['cat', 'dog', 'mouse']).reshape(-1, 1) encoder = OneHotEncoder() one_hot_encoded = encoder.fit_transform(data).toarray() print("One-Hot Encoded:\n", one_hot_encoded)
6. PCA (Principal Component Analysis)
Reduces dimensionality of data while preserving variance.
from sklearn.decomposition import PCA import numpy as np # Sample data data = np.random.rand(100, 5) # 100 samples, 5 features # Reduce to 2 components pca = PCA(n_components=2) data_reduced = pca.fit_transform(data) print("Reduced Data Shape:", data_reduced.shape)
7. KMeans
Performs K-Means clustering.
from sklearn.cluster import KMeans import numpy as np # Example dataset data = np.random.rand(50, 2) # 50 samples, 2 features # Perform Clustering kmeans = KMeans(n_clusters=3, random_state=42) kmeans.fit(data) # Cluster centers print("Cluster Centers:\n", kmeans.cluster_centers_)
8. LogisticRegression
Performs Logistic Regression (classification).
from sklearn.linear_model import LogisticRegression from sklearn.datasets import load_iris # Load data X, y = load_iris(return_X_y=True) # Fit logistic regression model = LogisticRegression() model.fit(X, y) # Predict predictions = model.predict(X) print("Predictions:", predictions)
9. LinearRegression
Fits a simple linear regression model.
from sklearn.linear_model import LinearRegression # Example data X = [[1], [2], [3]] y = [1, 2, 3] model = LinearRegression() model.fit(X, y) print("Coefficient:", model.coef_)
10. RandomForestClassifier
Fits a Random Forest classifier.
from sklearn.ensemble import RandomForestClassifier from sklearn.datasets import load_iris # Load data X, y = load_iris(return_X_y=True) # Train Random Forest clf = RandomForestClassifier(n_estimators=10, random_state=42) clf.fit(X, y)
11. GridSearchCV
Performs hyperparameter tuning via grid search.
from sklearn.model_selection import GridSearchCV from sklearn.ensemble import RandomForestClassifier # Data and model param_grid = {"n_estimators": [10, 50, 100], "max_depth": [3, 5, None]} model = RandomForestClassifier() # Grid search grid = GridSearchCV(model, param_grid=param_grid, cv=3) grid.fit(X, y) print("Best Parameters:", grid.best_params_)
12. cross_val_score
Performs cross-validation and returns scores.
from sklearn.model_selection import cross_val_score from sklearn.ensemble import RandomForestClassifier model = RandomForestClassifier() scores = cross_val_score(model, X, y, cv=5) print("Cross-Validation Scores:", scores)
13. ConfusionMatrixDisplay
Displays a confusion matrix.
from sklearn.metrics import ConfusionMatrixDisplay from sklearn.ensemble import RandomForestClassifier # Example prediction clf = RandomForestClassifier().fit(X, y) y_pred = clf.predict(X) ConfusionMatrixDisplay.from_predictions(y, y_pred)
14. Pipeline
Chains preprocessing steps with model training.
from sklearn.pipeline import Pipeline from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())]) pipeline.fit(X_train, y_train)
15. Feature Importance (RandomForestClassifier)
Access feature importance in tree-based models.
print("Feature Importances:", clf.feature_importances_)