Scikit-learn The Go-to Library for Machine Learning in Python

Scikit-learn: The Go-to Library for Machine Learning in Python

When it comes to building machine learning models in Python, Scikit-learn is one of the most widely used and versatile libraries in the ecosystem. Built on top of NumPy, SciPy, and Matplotlib, Scikit-learn provides simple and efficient tools for data mining, machine learning, and data analysis. Whether you’re a beginner in machine learning or an expert, Scikit-learn’s intuitive API, rich documentation, and powerful features make it the perfect choice for developing machine learning workflows.

Some of the standout features of Scikit-learn include:

Preprocessing: Tools for data preprocessing and feature engineering.
Model Selection: Utilities for optimizing and evaluating models.
Supervised Learning: Algorithms for classification and regression problems.
Unsupervised Learning: Algorithms for clustering and dimensionality reduction.
Pipeline: Integration of preprocessing and modeling steps for streamlined workflows.

In this post, we’ll explore the key APIs of Scikit-learn along with examples to demonstrate their use. Additionally, we’ll provide a generic application that combines multiple Scikit-learn functionalities into a complete project.

Key Scikit-learn APIs with Code Snippets

Here, we list 20+ useful APIs in Scikit-learn.

1. train_test_split

Splits the dataset into training and testing subsets.

  from sklearn.model_selection import train_test_split
  from sklearn.datasets import load_iris

  # Load dataset
  X, y = load_iris(return_X_y=True)

  # Split dataset
  X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)
  print("Train size:", X_train.shape, " | Test size:", X_test.shape)

2. StandardScaler

Standardizes features by scaling to zero mean and unit variance.

  from sklearn.preprocessing import StandardScaler

  # Example dataset
  data = [[1, 2], [2, 3], [3, 4]]

  # Standardize the dataset
  scaler = StandardScaler()
  data_scaled = scaler.fit_transform(data)
  print("Scaled Data:\n", data_scaled)

3. MinMaxScaler

Scales features to a specific range, typically [0, 1].

  from sklearn.preprocessing import MinMaxScaler

  # Example data
  data = [[1, 2], [2, 3], [3, 4]]

  # Scale data to [0, 1] range
  scaler = MinMaxScaler()
  data_scaled = scaler.fit_transform(data)
  print("Scaled Data:\n", data_scaled)

4. LabelEncoder

Encodes categorical labels as integers.

  from sklearn.preprocessing import LabelEncoder

  labels = ['cat', 'dog', 'cat', 'mouse']

  encoder = LabelEncoder()
  encoded_labels = encoder.fit_transform(labels)
  print("Encoded Labels:", encoded_labels)

5. OneHotEncoder

Encodes categorical features as one-hot numeric arrays.

  from sklearn.preprocessing import OneHotEncoder
  import numpy as np

  # Example categorical data
  data = np.array(['cat', 'dog', 'mouse']).reshape(-1, 1)

  encoder = OneHotEncoder()
  one_hot_encoded = encoder.fit_transform(data).toarray()
  print("One-Hot Encoded:\n", one_hot_encoded)

6. PCA (Principal Component Analysis)

Reduces dimensionality of data while preserving variance.

  from sklearn.decomposition import PCA
  import numpy as np

  # Sample data
  data = np.random.rand(100, 5)  # 100 samples, 5 features

  # Reduce to 2 components
  pca = PCA(n_components=2)
  data_reduced = pca.fit_transform(data)
  print("Reduced Data Shape:", data_reduced.shape)

7. KMeans

Performs K-Means clustering.

  from sklearn.cluster import KMeans
  import numpy as np

  # Example dataset
  data = np.random.rand(50, 2)  # 50 samples, 2 features

  # Perform Clustering
  kmeans = KMeans(n_clusters=3, random_state=42)
  kmeans.fit(data)

  # Cluster centers
  print("Cluster Centers:\n", kmeans.cluster_centers_)

8. LogisticRegression

Performs Logistic Regression (classification).

  from sklearn.linear_model import LogisticRegression
  from sklearn.datasets import load_iris

  # Load data
  X, y = load_iris(return_X_y=True)

  # Fit logistic regression
  model = LogisticRegression()
  model.fit(X, y)

  # Predict
  predictions = model.predict(X)
  print("Predictions:", predictions)

9. LinearRegression

Fits a simple linear regression model.

  from sklearn.linear_model import LinearRegression

  # Example data
  X = [[1], [2], [3]]
  y = [1, 2, 3]

  model = LinearRegression()
  model.fit(X, y)

  print("Coefficient:", model.coef_)

10. RandomForestClassifier

Fits a Random Forest classifier.

  from sklearn.ensemble import RandomForestClassifier
  from sklearn.datasets import load_iris

  # Load data
  X, y = load_iris(return_X_y=True)

  # Train Random Forest
  clf = RandomForestClassifier(n_estimators=10, random_state=42)
  clf.fit(X, y)

11. GridSearchCV

Performs hyperparameter tuning via grid search.

  from sklearn.model_selection import GridSearchCV
  from sklearn.ensemble import RandomForestClassifier

  # Data and model
  param_grid = {"n_estimators": [10, 50, 100], "max_depth": [3, 5, None]}
  model = RandomForestClassifier()

  # Grid search
  grid = GridSearchCV(model, param_grid=param_grid, cv=3)
  grid.fit(X, y)
  print("Best Parameters:", grid.best_params_)

12. cross_val_score

Performs cross-validation and returns scores.

  from sklearn.model_selection import cross_val_score
  from sklearn.ensemble import RandomForestClassifier

  model = RandomForestClassifier()
  scores = cross_val_score(model, X, y, cv=5)
  print("Cross-Validation Scores:", scores)

13. ConfusionMatrixDisplay

Displays a confusion matrix.

  from sklearn.metrics import ConfusionMatrixDisplay
  from sklearn.ensemble import RandomForestClassifier

  # Example prediction
  clf = RandomForestClassifier().fit(X, y)
  y_pred = clf.predict(X)

  ConfusionMatrixDisplay.from_predictions(y, y_pred)

14. Pipeline

Chains preprocessing steps with model training.

  from sklearn.pipeline import Pipeline
  from sklearn.preprocessing import StandardScaler
  from sklearn.linear_model import LogisticRegression

  pipeline = Pipeline([('scaler', StandardScaler()), ('model', LogisticRegression())])
  pipeline.fit(X_train, y_train)

15. Feature Importance (RandomForestClassifier)

Access feature importance in tree-based models.

  print("Feature Importances:", clf.feature_importances_)