MS Thesis: Latent Patient State Compression & Deep Imputation via Selective SSMs

1. Abstract

Operational environments in healthcare business intelligence and enterprise hospital performance platforms frequently function under severe localized data ingestion constraints. Unlike localized intensive care units featuring streaming telemetry, enterprise clinical data engines typically accept Electronic Health Record (EHR) drops in discrete monthly or quarterly batch snapshots. This introduces systemic temporal misalignment, irregular sampling, and severe artifact multi-scale sparsity across long-range patient records.

To extract deep prognostic and economic utility from these asynchronous records, this manuscript introduces an offline, non-causal machine learning framework powered by Selective State Space Models (SSMs). By structuring an irregular, continuous-time formulation mapped into a bidirectional, block-parallel architecture, our framework achieves dual-system optimization: it tracks and imputes continuous latent physiological vectors across sparse temporal windows while compressing arbitrary-length multi-year patient histories into a static, low-dimensional **Patient State Portrait Vector** $\mathbf{z}_p \in \mathbb{R}^{2d}$.

2. Operational Paradox & Structural Background

Traditional clinical deep learning pipelines implicitly assume highly aligned, synchronous, or high-frequency telemetry matrices. In enterprise clinical performance platforms, this assumption falls apart. Clinical operations run on transactional batch intervals. When an analytical pipeline receives data snapshots every 30 or 90 days, the absolute temporal sequence of a patient’s life is deeply fragmented. A typical cohort patient profile contains a series of sparse, localized clusters: an outpatient encounter on Day 4, a metabolic panel on Day 12, a localized prescription adjustment on Day 18, followed by weeks of absolute analytical emptiness.

The Classical Imputation Pitfall:

Standard approaches rely on naive engineering heuristics, including **Last Observation Carried Forward (LOCF)**, mean imputation, or regularized spline fitting. These methods treat time as a uniform sequence of slots, introducing severe synthetic noise. For example, carrying forward a blood creatinine or urea nitrogen value from 45 days prior treats a dynamic biological state as a frozen constant. This destroys the temporal signal required to evaluate clinical degradation or sudden disease trajectory changes.

Recurrent Neural Networks (RNNs) and Long Short-Term Memory (LSTM) networks suffer from severe gradient vanishing when capturing long-term historical records across multi-year spans. Conversely, Multi-Head Attention Transformers scale quadratically ($O(L^2)$) relative to length $L$. This makes them incredibly expensive to scale over tens of thousands of historical patient sequences.

By pivoting to **Selective State Space Models**, we implement sub-quadratic, linear-time implementations that retain long-range historical memory. This provides an elegant framework to reconstruct historical clinical pathways when new batch drops arrive.

3. Mathematical Architecture Foundations

We define a patient's long-range historical timeline as a collection of asynchronous observations. Let the input sequence be defined as $S = \{(t_k, \mathbf{x}_k)\}_{k=1}^L$, where $t_k \in \mathbb{R}^+$ represents an absolute monotonically increasing timestamp marking a record edit, and $\mathbf{x}_k \in \mathbb{R}^M$ represents a mixed feature vector containing current labs, vitals, and diagnostic embeddings.

3.1 Time-Varying Continuous-Time Parameterization

To handle irregular intervals natively, we ground the underlying sequence model within a continuous-time linear system. The hidden state $\mathbf{h}(t) \in \mathbb{R}^d$ evolves based on a continuous vector input $\mathbf{x}(t) \in \mathbb{R}^M$ governed by the following core differential system:

$$\frac{d}{dt}\mathbf{h}(t) = \mathbf{A}(t)\mathbf{h}(t) + \mathbf{B}(t)\mathbf{x}(t), \quad \mathbf{y}(t) = \mathbf{C}(t)\mathbf{h}(t)$$

Where $\mathbf{A} \in \mathbb{R}^{d \times d}$ is structured via a HiPPO matrix initialization framework to enable stable long-range history tracking. To process this on discrete, irregularly spaced computer records, we use a zero-order hold (ZOH) discretization step. This step incorporates the dynamic, data-driven step size $\Delta_k = t_k - t_{k-1}$:

$$\mathbf{\overline{A}}_k = \exp(\Delta_k \mathbf{A})$$ $$\mathbf{\overline{B}}_k = (\Delta_k \mathbf{A})^{-1}(\exp(\Delta_k \mathbf{A}) - \mathbf{I}) \cdot \Delta_k \mathbf{B}_k$$

3.2 The Core Selection Mechanism

Unlike classical Linear Time-Invariant (LTI) state space networks, our architecture uses data-dependent selection operators. This allows the model to adjust its parameters based on incoming information. Let the matrix parameters $\mathbf{B}_k$, $\mathbf{C}_k$, and step size $\Delta_k$ be direct functional projections of the current vector input $\mathbf{x}_k$:

$$\mathbf{B}_k = \text{Linear}_B(\mathbf{x}_k), \quad \mathbf{C}_k = \text{Linear}_C(\mathbf{x}_k)$$ $$\Delta_k = \ln(1 + \exp(\text{Linear}_\Delta(\mathbf{x}_k) + \tau))$$

This design allows the model to intelligently filter incoming data. If a specific timestamp entry $t_k$ represents an irrelevant administrative update or redundant billing code, the network squashes $\Delta_k \to 0$. This forces $\mathbf{\overline{A}}_k \to \mathbf{I}$, safely passing the core patient vector through that timeline node without distorting the underlying physiological history.

3.3 Non-Causal Bidirectional Synthesis

Because our data processing runs entirely offline on stable batch updates, the network can process information in both temporal directions simultaneously. For each sequence, we evaluate a forward pass to capture historical context, and a backward pass to integrate future timeline trends:

$$\vec{\mathbf{h}}_k = \vec{\mathbf{\overline{A}}}_k \vec{\mathbf{h}}_{k-1} + \vec{\mathbf{\overline{B}}}_k \mathbf{x}_k$$ $$\overleftarrow{\mathbf{h}}_k = \overleftarrow{\mathbf{\overline{A}}}_k \overleftarrow{\mathbf{h}}_{k+1} + \overleftarrow{\mathbf{\overline{B}}}_k \mathbf{x}_k$$

At each timestamp $k$, these two vectors are combined to form a bidirectional hidden representation: $\mathbf{M}_k = [\vec{\mathbf{h}}_k \parallel \overleftarrow{\mathbf{h}}_k] \in \mathbb{R}^{2d}$. This fused matrix passes through a linear decoder head to calculate continuous feature reconstructions, successfully filling in unobserved record gaps across the timeline.

4. Source Data Engineering & MIMIC-IV Grounding

To test this architecture under realistic operational constraints, we use the open-source **MIMIC-IV (v2.2)** database. This allows us to reconstruct historical timelines across long-term intervals, mimicking the behavior of quarterly batch data drops.

The extraction pipeline focuses on identifying cohorts with highly variable, non-uniform clinical touches. We target three core tables:

mimiciv_hosp.admissions: Extracts demographic anchors and absolute admission/discharge timelines.
mimiciv_hosp.labevents: Extracts dynamic numerical measurements (e.g., Serum Creatinine, Blood Urea Nitrogen, Bicarbonate) that frequently exhibit irregular sampling patterns.
mimiciv_hosp.prescriptions: Pulls therapeutic medication events, which are converted into temporary feature flags.

mimic_cohort_ingestion.sql BigQuery Standard

WITH RawEvents AS (
    SELECT 
        le.subject_id, 
        le.charttime AS event_time,
        CASE 
            WHEN le.itemid = 50912 THEN 'CREATININE'
            WHEN le.itemid = 51006 THEN 'BUN'
            WHEN le.itemid = 50882 THEN 'BICARBONATE'
        END AS feature_name,
        le.valuenum AS feature_value
    FROM `physionet-data.mimiciv_hosp.labevents` le
    WHERE le.itemid IN (50912, 51006, 50882)
      AND le.valuenum IS NOT NULL

    UNION ALL

    SELECT 
        pr.subject_id, 
        pr.starttime AS event_time,
        'MED_' || REGEXP_REPLACE(UPPER(pr.drug), r'[^A-Z0-9]', '_') AS feature_name,
        1.0 AS feature_value
    FROM `physionet-data.mimiciv_hosp.prescriptions` pr
    WHERE pr.drug IS NOT NULL
),
OrderedTimeline AS (
    SELECT 
        subject_id,
        event_time,
        feature_name,
        feature_value,
        ROW_NUMBER() OVER (PARTITION BY subject_id ORDER BY event_time) as seq_idx
    FROM RawEvents
),
DeltaCalculation AS (
    SELECT 
        curr.subject_id,
        curr.event_time,
        curr.feature_name,
        curr.feature_value,
        curr.seq_idx,
        COALESCE(
            TIMESTAMP_DIFF(curr.event_time, prev.event_time, MINUTE), 
            0
        ) AS delta_minutes
    FROM OrderedTimeline curr
    LEFT JOIN OrderedTimeline prev 
      ON curr.subject_id = prev.subject_id 
     AND curr.seq_idx = prev.seq_idx + 1
)
SELECT 
    subject_id,
    event_time,
    delta_minutes,
    feature_name,
    feature_value
FROM DeltaCalculation
ORDER BY subject_id, seq_idx;

5. High-Performance Model PyTorch Specification

This model implements our non-causal state space compressor in PyTorch. It uses bidirectional processing to ingest patient timelines and generate missing data projections alongside downstream risk assessments.

ssm_bi_compressor.py v1.4.2

import torch
import torch.nn as nn
from mamba_ssm import Mamba

class ClinicalStateCompressor(nn.Module):
    """
    Bidirectional Selective SSM for non-causal longitudinal trajectory
    compression and value imputation on asynchronous EHR batch data drops.
    """
    def __init__(self, num_features, d_model=256, d_state=32):
        super(ClinicalStateCompressor, self).__init__()
        
        # Linear layer combining raw sparse metrics with empirical delta trackers
        self.embedding_layer = nn.Linear(num_features + 1, d_model)
        
        # Dual-Directional Mamba operators for non-causal sequence analysis
        self.forward_mamba = Mamba(
            d_model=d_model, 
            d_state=d_state, 
            d_conv=4, 
            expand=2
        )
        self.backward_mamba = Mamba(
            d_model=d_model, 
            d_state=d_state, 
            d_conv=4, 
            expand=2
        )
        
        # Imputation Decoder: Maps bi-directional hidden layers to input feature dimensions
        self.imputation_decoder = nn.Linear(d_model * 2, num_features)
        
        # Risk Predictor Classifier: Estimates out-of-sample patient risks for the upcoming quarter
        self.risk_classifier = nn.Sequential(
            nn.Linear(d_model * 2, 128),
            nn.LayerNorm(128),
            nn.GELU(),
            nn.Dropout(0.3),
            nn.Linear(128, 15),
            nn.Sigmoid()
        )

    def forward(self, features, deltas):
        """
        Args:
            features (Tensor): [Batch Size, Sequence Length, Number of Clinical Features]
            deltas (Tensor):   [Batch Size, Sequence Length, 1] (Time differences)
        """
        # Combine clinical measurements and time step differences into a single input representation
        x = torch.cat([features, deltas], dim=-1)
        x_prime = self.embedding_layer(x)
        
        # Execute forward pass over historical context
        h_fwd = self.forward_mamba(x_prime)
        
        # Reverse chronological timeline order for backward pass processing
        x_prime_reversed = torch.flip(x_prime, dims=[1])
        h_bwd_raw = self.backward_mamba(x_prime_reversed)
        h_bwd = torch.flip(h_bwd_raw, dims=[1])
        
        # Construct bidirectional hidden representation matrix
        m_k = torch.cat([h_fwd, h_bwd], dim=-1)
        
        # Reconstruct missing clinical values across the sequence timeline
        imputed_trajectory = self.imputation_decoder(m_k)
        
        # Isolate the final hidden vector to serve as the fixed-dimensional patient portrait
        z_p = m_k[:, -1, :]
        
        # Generate downstream operational and clinical risk classifications
        predicted_risk_profile = self.risk_classifier(z_p)
        
        return imputed_trajectory, predicted_risk_profile, z_p

6. Empirical Evaluation & Rigorous Baseline Matrix

To evaluate how effectively the **Patient State Portrait Vector ($\mathbf{z}_p$)** captures long-term historical records, we benchmark our architecture against traditional clinical sequence models.

The framework is tested on its ability to forecast critical clinical events during a data-blind 90-day window following each batch drop. Performance metrics look specifically at predicting shifts in Acute Kidney Injury (AKI) severity levels using the KDIGO clinical criteria, alongside forecasting unplanned 30-day readmissions.

Baseline Strategy	Sparsity Handling Mechanism	Computational Scaling	Batch Drop Suitability
LSTM + LOCF Heuristics	Constant zero-order forward propagation. No intrinsic interval understanding.	$O(L)$	Poor. Struggles with vanishing gradients across multi-year historical patient records.
Multi-Time Attention Network (mTAN)	Learned continuous time-embedding functions mapped to query vectors.	$O(L^2)$	Moderate. Delivers high accuracy but scales poorly due to extreme quadratic memory bottlenecks.
Bidirectional Selective SSM (Proposed)	Continuous-time linear differential parameterization via data-dependent delta adjustment.	$O(L)$	Optimal. Compresses multi-year, irregularly sampled histories efficiently whenever batch databases are updated.

DEPARTMENT OF COMPUTER SCIENCE & DATA SCIENCE INFORMATICS

Resolving Multi-Scale Temporal Sparsity and Irregular Alignment in Interval-Based EHR Registries via Non-Causal Block-Parallel Selective State Space Models