Learn to Publish Your Python Package in Simple Ways

Learn How to Turn Your ML Project into a Reusable Python Library

Mar 12, 2025

∙ Paid

As a data scientist with Python programming knowledge, I am sure the reader has experience installing the machine learning package for any data work. However, have you thought about creating and publishing your package? Especially if you have a reusable pipeline that you want to share with the community.

Building and publishing your package has many advantages, including consistency across coding projects, easy collaboration with the community, and improved credibility within the field.

Understanding how important it is to build and publish your package, this article will teach you how to do that in these easy steps.

Curious about it? Let’s get into it.

Build and Publish Your Python ML Packages

In this article, we will build a reusable machine-learning classifier pipeline that we can use easily by changing the parameters instead of structuring the pipeline from scratch.

For example, the pipeline above will be distributed as a package called selectml. You can also see the whole code in the selectml repository.

As a reminder, the reader is already comfortable using Python and understands how to build a simple machine-learning model. We will not explore machine learning; we will focus on building and publishing the ML packages.

Let’s start our project by preparing the structure.

Step 1: Prepare Your Project Structure

The first step is to prepare our project structure. For the selected package, we will use the following structure. Also, don’t forget to create a virtual environment to isolate everything from your main environment.

selectml/
├── selectml/
│   ├── __init__.py
│   ├── preprocessing.py
│   ├── models.py
│   └── pipeline.py
├── tests/
│   └── test_pipeline.py
├── README.md
├── setup.py
├── LICENSE
└── requirements.txt

Create all the folders and files as structured above, and let’s move on to the next step.

Step 2: Develop The Library Code

This is the step in which we will develop all the required machine learning pipeline codes for the selected library. We will divide it into three parts: the data preprocessing, the models, and the pipeline.

First, let’s start with the preprocessing. We will use the following code inside the preprocessing file.

import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer

class DataPreprocessor:
    def __init__(self):
        self.preprocessor = None

    def fit_transform(self, df, numerical_features, categorical_features):
        """Fits transformers on numerical and categorical features and transforms the data."""
        numeric_transformer = StandardScaler()
        categorical_transformer = OneHotEncoder(handle_unknown='ignore')
       
        self.preprocessor = ColumnTransformer(
            transformers=[
                ('num', numeric_transformer, numerical_features),
                ('cat', categorical_transformer, categorical_features)
            ]
        )
        return self.preprocessor.fit_transform(df)

    def transform(self, df):
        """ Transforms new data using the previously fitted transformer."""
        if self.preprocessor is None:
            raise ValueError("The preprocessor has not been fitted. Call fit_transform() first.")
        return self.preprocessor.transform(df)

The code above initiates a class called DataProcessor with two different methods that transform all the numerical and categorical features the machine learning model can accept.

Next, we will prepare the model selector class, in which we only need to pass the string parameter to switch models easily.

from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC

class ModelSelector:
    def __init__(self, model_type='logistic'):
        self.model_type = model_type
        if model_type == 'logistic':
            self.model = LogisticRegression()
        elif model_type == 'random_forest':
            self.model = RandomForestClassifier(n_estimators=100)
        elif model_type == 'svm':
            self.model = SVC(probability=True)
        else:
            raise ValueError("Unsupported model type. Choose 'logistic', 'random_forest', or 'svm'.")

    def train(self, X, y):
        """ Fits the selected model on the training data."""
        self.model.fit(X, y)
        return self

    def predict(self, X):
        """ Generates predictions from the fitted model."""
        return self.model.predict(X)

    def predict_proba(self, X):
        """ Provides probability estimates if available."""
        if hasattr(self.model, "predict_proba"):
            return self.model.predict_proba(X)
        else:
            raise AttributeError("This model does not support probability estimates.")

The Class ModelSelector also comes with three different methods, similar to the Scikit-Learn API, where we can train and make predictions based on the data we pass.

Lastly, we will tie together both modules we have developed previously to create a smooth machine-learning pipeline.

from .preprocessing import DataPreprocessor
from .models import ModelSelector
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report

class ModelSelectionPipeline:
    def __init__(self, model_type='logistic'):
        self.preprocessor = DataPreprocessor()
        self.model_selector = ModelSelector(model_type=model_type)

    def run_pipeline(self, df, target, numerical_features, categorical_features, test_size=0.2, random_state=42):
        """
        Executes the entire pipeline:
         - Preprocesses the data
         - Splits into training and test sets
         - Trains the selected model
         - Evaluates model performance
        """
        X = df.drop(columns=[target])
        y = df[target]
        X_processed = self.preprocessor.fit_transform(X, numerical_features, categorical_features)
       
        X_train, X_test, y_train, y_test = train_test_split(X_processed, y, test_size=test_size, random_state=random_state)
       
        self.model_selector.train(X_train, y_train)
       
        predictions = self.model_selector.predict(X_test)
        acc = accuracy_score(y_test, predictions)
        report = classification_report(y_test, predictions)
       
        return {
            'accuracy': acc,
            'report': report,
            'model': self.model_selector.model
        }

The ModelSelectionPipeline class will process an end-to-end pipeline from preprocessing our data to evaluating our model performance. That’s everything for our core module. Let’s expose the entire module by adding the following code in the init file.

from .preprocessing import DataPreprocessor
from .models import ModelSelector
from .pipeline import ModelSelection

Pipeline__all__ = ["DataPreprocessor", "ModelSelector", "ModelSelectionPipeline"]

With all the code ready, let’s add a Unit Test to evaluate the code correctly.

Step 3: Unit Test

A unit test is later used to validate the pipeline we created previously. It’s like a simple test with the data example to see if our pipeline provides the desired output.

Put the test pipeline file with the following code to do the above.

import pandas as pd
from selectml.pipeline import ModelSelectionPipeline

def test_model_selection_pipeline():
    data = {
        'age': [25, 32, 47, 51, 23, 45, 36, 29, 40, 33, 28, 52, 37, 46, 31, 44, 39, 27, 50, 35],
        'income': [50000, 60000, 80000, 90000, 40000, 75000, 65000, 55000, 70000, 62000,
                   48000, 91000, 68000, 77000, 59000, 80000, 72000, 53000, 85000, 66000],
        'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix', 'Philadelphia',
                 'San Antonio', 'San Diego', 'Dallas', 'San Jose', 'Austin', 'Jacksonville',
                 'Fort Worth', 'Columbus', 'Charlotte', 'Indianapolis', 'San Francisco',
                 'Seattle', 'Denver', 'Washington'],
        'purchased': [0, 1, 0, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 1, 1, 0]
    }
    df = pd.DataFrame(data)
   
    numerical_features = ['age', 'income']
    categorical_features = ['city']
   
    pipeline = ModelSelectionPipeline(model_type='random_forest')
    result = pipeline.run_pipeline(df, target='purchased',
                                   numerical_features=numerical_features,
                                   categorical_features=categorical_features)
   
    assert 'accuracy' in result
    assert 'report' in result
    print("Test passed with accuracy:", result['accuracy'])

We can run the test using the code below. You can change the code to import the pipeline class from the module we have created above. However, we will try out the test after publishing our library.

pytest tests/test_pipeline.py

With all the code ready, let’s document our package.

Continue reading this post for free, courtesy of Cornellius Yudha Wijaya.

Or purchase a paid subscription.

Non-Brand Data