Persistence Images in Classification

This notebook shows how you can use persistent homology and persistence images to classify datasets. We construct datasets from two classes, one just noise and the other noise with a big circle in the middle. We then compute persistence diagrams with and convert them to persistence images with PersIm. Using these persistence images, we build a Logistic Regression model using a LASSO penatly to classify whether the dataset has a circle or not. We find, using only default values, classification has a mean accuracy greater than 90.

import numpy as np
from sklearn import datasets
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split

import matplotlib.pyplot as plt

from ripser import Rips
from persim import PersImage

Construct data

Generate N datasets that are just noise and N that are noise with a circle.

N = 200
N_per_class = int(N / 2)
N_in_class = 400

def noise(N, scale):
    return scale * np.random.random((N, 2))

def circle(N, scale, offset):
    return offset + scale * datasets.make_circles(n_samples=N, factor=0.4, noise=0.05)[0]

just_noise = [noise(N_in_class, 150) for _ in range(N_per_class)]

half = int(N_in_class / 2)
with_circle = [np.concatenate((circle(half, 50, 70), noise(half, 150)))
               for _ in range(N_per_class)]

datas = []

# Define labels
labels = np.zeros(N)
labels[N_per_class:] = 1
# Visualize the data
fig, axs = plt.subplots(1, 2)

xs, ys = just_noise[0][:,0], just_noise[0][:,1]
axs[0].scatter(xs, ys)
axs[0].set_title("Example noise dataset")
axs[0].set_aspect('equal', 'box')

xs_, ys_ = with_circle[0][:,0], with_circle[0][:,1]
axs[1].scatter(xs_, ys_)
axs[1].set_title("Example noise with circle dataset")
axs[1].set_aspect('equal', 'box')


Compute homology of each dataset

Generate the persistence diagram of \(H_1\) for each of the datasets generated above.

rips = Rips(maxdim=1, coeff=2)
diagrams = [rips.fit_transform(data) for data in datas]
diagrams_h1 = [rips.fit_transform(data)[1] for data in datas]
Rips(maxdim=1, thresh=inf, coeff=2, do_cocycles=False, verbose=True)

rips.plot(diagrams_h1[0], show=False)
plt.title("PD of $H_1$ for just noise")

rips.plot(diagrams_h1[-1], show=False)
plt.title("PD of $H_1$ for circle w/ noise")

Compute persistence images

Convert each persistence diagram into a persistence image. Flatten each image into a vector format

pim = PersImage(pixels=[20,20], spread=1)
imgs = pim.transform(diagrams_h1)
PersImage(pixels=[20, 20], spread=1, specs=None, kernel_type="gaussian", weighting_type="linear")
imgs_array = np.array([img.flatten() for img in imgs])

for i in range(4):
    ax = plt.subplot(240+i+1)[i], ax)
    plt.title("PI of $H_1$ for noise")

for i in range(4):
    ax = plt.subplot(240+i+5)[-(i+1)], ax)
    plt.title("PI of $H_1$ for circle w/ noise")

Classify the datasets from the persistence images

X_train, X_test, y_train, y_test = train_test_split(imgs_array, labels, test_size=0.40, random_state=42)
lr = LogisticRegression(), y_train)
LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
          intercept_scaling=1, max_iter=100, multi_class='ovr', n_jobs=1,
          penalty='l2', random_state=None, solver='liblinear', tol=0.0001,
          verbose=0, warm_start=False)
lr.score(X_test, y_test)

So these are perfectly descriminative? Seems almost too good to be true. Let’s see what features are important.

Inverse analysis on LASSO

Visualizing the regression coefficients as a persistence image shows us which features of the images are most important for classification.

inverse_image = np.copy(lr.coef_).reshape((20,20))
[ ]: