Scikit-learn, often abbreviated as "sklearn," is a popular open-source machine learning library for Python. It provides a wide range of tools and algorithms for various machine learning tasks, including classification, regression, clustering, dimensionality reduction, model selection, and more. Scikit-learn is built on top of other scientific libraries like NumPy, SciPy, and Matplotlib, making it a powerful and versatile choice for machine learning in Python.
Here are some of the commonly used scikit-learn functions and algorithms, along with examples and sample outputs:
Loading and Preprocessing Data:
Scikit-learn provides functions to load and preprocess datasets. In this example, we'll use the Iris dataset:
from sklearn import datasets
# Load the Iris dataset
iris = datasets.load_iris()
X = iris.data # Features
y = iris.target # Labels
Splitting Data:
You can split your data into training and testing sets using train_test_split:
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)
Classification with Support Vector Machines (SVM):
Scikit-learn provides various classification algorithms. Here's an example using Support Vector Machines (SVM):
from sklearn.svm import SVC
# Create an SVM classifier
clf = SVC()
# Fit the classifier on the training data
clf.fit(X_train, y_train)
# Make predictions
y_pred = clf.predict(X_test)
Model Evaluation:
You can evaluate the model using metrics like accuracy, precision, recall, and F1-score:
from sklearn.metrics import accuracy_score, classification_report
accuracy = accuracy_score(y_test, y_pred)
report = classification_report(y_test, y_pred)
print(f"Accuracy: {accuracy}")
print("Classification Report:\n", report)
Dimensionality Reduction with Principal Component Analysis (PCA):
Scikit-learn supports dimensionality reduction techniques like PCA:
from sklearn.decomposition import PCA
# Create a PCA object and fit it to the data
pca = PCA(n_components=2)
X_reduced = pca.fit_transform(X)
# X_reduced contains the reduced-dimension data
Clustering with K-Means:
You can perform clustering using algorithms like K-Means:
from sklearn.cluster import KMeans
# Create a K-Means clustering model
kmeans = KMeans(n_clusters=3)
# Fit the model to the data
kmeans.fit(X)
# Cluster labels for each data point
labels = kmeans.labels_
Cross-Validation:
Scikit-learn supports cross-validation for model evaluation:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(clf, X, y, cv=5) # 5-fold cross-validation
Hyperparameter Tuning with Grid Search:
Grid search helps you find the best hyperparameters for your model:
from sklearn.model_selection import GridSearchCV
param_grid = {'C': [1, 10, 100], 'kernel': ['linear', 'rbf']}
grid_search = GridSearchCV(SVC(), param_grid, cv=5)
grid_search.fit(X_train, y_train)
best_params = grid_search.best_params_
Feature Scaling:
Scikit-learn provides utilities for feature scaling, such as MinMaxScaler and StandardScaler, which normalize and standardize features, respectively.
from sklearn.preprocessing import MinMaxScaler
scaler = MinMaxScaler()
X_scaled = scaler.fit_transform(X)
Model Persistence:
You can save trained models for future use:
import joblib
joblib.dump(clf, 'svm_model.pkl')
loaded_model = joblib.load('svm_model.pkl')
Scikit-learn is a powerful library that simplifies many aspects of machine learning in Python. It offers consistent APIs and a wide range of algorithms, making it a valuable tool for both beginners and experienced machine learning practitioners. The examples provided here cover just a subset of scikit-learn's capabilities, and you can explore more advanced techniques and models as needed for your specific machine learning tasks.
Top comments (0)