Spaces:

jkitchin
/

tennessee-eastman-process

Running

App Files Files Community

tennessee-eastman-process / Deploy from GitHub Actions /docs /rieth2017_dataset.md

jkitchin

Upload folder using huggingface_hub

f5f42f3 verified 20 days ago

preview code

raw

history blame contribute delete

18.9 kB

Rieth et al. 2017 TEP Dataset

This document describes the Tennessee Eastman Process dataset published by Rieth et al. (2017) for anomaly detection research.

Overview

The Rieth et al. 2017 dataset addresses a critical limitation of previous TEP datasets: they contained only a single simulation per fault type, which can lead to biased evaluation results. This dataset provides 500 independent simulations per fault type using non-overlapping random number generator seeds.

Citation

@inproceedings{rieth2017issues,
  title={Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems},
  author={Rieth, Christoph A. and Amsel, Ben D. and Tran, Randy and Cook, Maia B.},
  booktitle={Advances in Human Factors in Robots and Unmanned Systems},
  series={Advances in Intelligent Systems and Computing},
  volume={595},
  pages={52--63},
  year={2018},
  publisher={Springer},
  doi={10.1007/978-3-319-60384-1_6}
}

Presented at: AHFE 2017 (Applied Human Factors and Ergonomics Conference), July 17-21, 2017, Los Angeles, CA

Sponsor: Office of Naval Research (contract N00014-15-C-5003)

Dataset Access

Harvard Dataverse: https://doi.org/10.7910/DVN/6C3JR1

License: Public Domain (CC0)

Dataset Structure

Files

The dataset consists of 4 RData files:

File	Description
`fault_free_training.RData`	Normal operation training data
`fault_free_testing.RData`	Normal operation testing data
`faulty_training.RData`	Fault scenario training data
`faulty_testing.RData`	Fault scenario testing data

Column Structure (55 columns)

Column	Name	Values	Description
1	`faultNumber`	0-20	Fault type (0 = normal)
2	`simulationRun`	1-500	Simulation run with unique seed
3	`sample`	1-500 or 1-960	Time sample index
4-44	`xmeas_1` to `xmeas_41`	float	41 measured variables
45-55	`xmv_1` to `xmv_11`	float	11 manipulated variables

Simulation Parameters

Parameter	Training	Validation	Testing
Duration	25 hours	48 hours	48 hours
Sampling rate	3 minutes	3 minutes	3 minutes
Samples per run	500	960	960
Simulations per fault	500	500	500
Operating mode	Mode 1	Mode 1	Mode 1
Fault introduction	t=0	t=1 hour	t=1 hour

Note: The validation set is an extension to the original Rieth 2017 dataset for machine learning workflows requiring train/validation/test splits. It uses independent random seeds that do not overlap with training or testing data.

Fault Types (IDV 1-20)

Step Disturbances (IDV 1-7)

IDV	Description
1	A/C feed ratio, B composition constant (Stream 4)
2	B composition, A/C ratio constant (Stream 4)
3	D feed temperature (Stream 2)
4	Reactor cooling water inlet temperature
5	Condenser cooling water inlet temperature
6	A feed loss (Stream 1)
7	C header pressure loss - Loss of reactor feed (Stream 4)

Random Variation Disturbances (IDV 8-12)

IDV	Description
8	A, B, C feed composition (Stream 4)
9	D feed temperature (Stream 2)
10	C feed temperature (Stream 4)
11	Reactor cooling water inlet temperature
12	Condenser cooling water inlet temperature

Special Disturbances (IDV 13-20)

IDV	Type	Description
13	Slow drift	Reaction kinetics
14	Sticking	Reactor cooling water valve
15	Sticking	Condenser cooling water valve
16-20	Unknown	Intentionally undisclosed

Why This Dataset Matters

Statistical rigor: 500 simulations per fault allow proper ROC curve analysis and statistical significance testing.
Reproducibility: Non-overlapping random seeds ensure independent simulation runs.
Fair benchmarking: Enables meaningful comparison of anomaly detection methods.
Addresses bias: Previous single-simulation datasets gave inconsistent, potentially misleading results.

Process Variables

Measured Variables (XMEAS 1-41)

Index	Variable	Description	Units
1	XMEAS(1)	A Feed	kscmh
2	XMEAS(2)	D Feed	kg/hr
3	XMEAS(3)	E Feed	kg/hr
4	XMEAS(4)	A and C Feed	kscmh
5	XMEAS(5)	Recycle Flow	kscmh
6	XMEAS(6)	Reactor Feed Rate	kscmh
7	XMEAS(7)	Reactor Pressure	kPa gauge
8	XMEAS(8)	Reactor Level	%
9	XMEAS(9)	Reactor Temperature	deg C
10	XMEAS(10)	Purge Rate	kscmh
11	XMEAS(11)	Separator Temperature	deg C
12	XMEAS(12)	Separator Level	%
13	XMEAS(13)	Separator Pressure	kPa gauge
14	XMEAS(14)	Separator Underflow	m3/hr
15	XMEAS(15)	Stripper Level	%
16	XMEAS(16)	Stripper Pressure	kPa gauge
17	XMEAS(17)	Stripper Underflow	m3/hr
18	XMEAS(18)	Stripper Temperature	deg C
19	XMEAS(19)	Stripper Steam Flow	kg/hr
20	XMEAS(20)	Compressor Work	kW
21	XMEAS(21)	Reactor CW Outlet Temp	deg C
22	XMEAS(22)	Separator CW Outlet Temp	deg C
23-28	XMEAS(23-28)	Reactor Feed Composition	mol% A-F
29-36	XMEAS(29-36)	Purge Gas Composition	mol% A-H
37-41	XMEAS(37-41)	Product Composition	mol% D-H

Manipulated Variables (XMV 1-11)

Index	Variable	Description
1	XMV(1)	D Feed Flow
2	XMV(2)	E Feed Flow
3	XMV(3)	A Feed Flow
4	XMV(4)	A and C Feed Flow
5	XMV(5)	Compressor Recycle Valve
6	XMV(6)	Purge Valve
7	XMV(7)	Separator Pot Liquid Flow
8	XMV(8)	Stripper Liquid Product Flow
9	XMV(9)	Stripper Steam Valve
10	XMV(10)	Reactor Cooling Water Flow
11	XMV(11)	Condenser Cooling Water Flow

Generating the Dataset

This repository includes a script to reproduce the Rieth 2017 dataset using the local TEP simulator. All parameters are configurable, with defaults matching the original Rieth 2017 specifications. See examples/rieth2017_dataset.py for the full implementation.

Quick Start

# Generate a small test dataset (5 simulations per fault)
python examples/rieth2017_dataset.py --small

# Generate the full dataset (500 simulations per fault, takes several hours)
python examples/rieth2017_dataset.py --full

# Generate a custom dataset
python examples/rieth2017_dataset.py --n-simulations 100 --faults 1,2,4,6

# Use a preset configuration
python examples/rieth2017_dataset.py --preset quick

# Generate in parallel with 4 workers
python examples/rieth2017_dataset.py --preset quick --workers 4

# Output as CSV instead of NumPy
python examples/rieth2017_dataset.py --preset quick --format csv

Presets

Named presets provide convenient configurations for common use cases:

Preset	Simulations	Train	Test	Sampling	Description
`rieth2017`	500	25h	48h	3 min	Original paper specifications
`quick`	5	2h	4h	3 min	Fast testing and development
`high-res`	500	25h	48h	1 min	Higher temporal resolution
`minimal`	2	0.5h	1h	3 min	Minimal for unit tests

# List available presets
python examples/rieth2017_dataset.py --list-presets

# Use a preset with overrides
python examples/rieth2017_dataset.py --preset quick --n-simulations 20

Output Formats

Data can be saved in multiple formats:

Format	Extension	Description
`npy`	`.npy`	NumPy binary format (default, fastest)
`csv`	`.csv`	Comma-separated values with headers
`hdf5`	`.h5`	HDF5 with gzip compression (requires h5py)

# Single format
python examples/rieth2017_dataset.py --preset quick --format csv

# Multiple formats
python examples/rieth2017_dataset.py --preset quick --format npy,csv,hdf5

Parallel Generation

Use multiple CPU cores to speed up dataset generation:

# Use 4 parallel workers
python examples/rieth2017_dataset.py --preset quick --workers 4

# Use all available CPU cores
python examples/rieth2017_dataset.py --preset quick --workers -1

Column Selection

Select subsets of process variables to reduce dataset size:

Group	Variables	Count	Description
`all`	xmeas_1-41, xmv_1-11	52	All variables (default)
`xmeas`	xmeas_1-41	41	Measured variables only
`xmv`	xmv_1-11	11	Manipulated variables only
`key`	xmeas_7-9,11,12,15,20, xmv_1,10	9	Key process variables
`flows`	xmeas_1-6,10,14,17,19	10	Flow measurements
`temperatures`	xmeas_9,11,18,21,22	5	Temperature measurements
`pressures`	xmeas_7,13,16	3	Pressure measurements
`levels`	xmeas_8,12,15	3	Level measurements
`compositions`	xmeas_23-41	19	Composition measurements

# List available column groups
python examples/rieth2017_dataset.py --list-columns

# Use a column group
python examples/rieth2017_dataset.py --preset quick --columns xmeas

# Select specific columns
python examples/rieth2017_dataset.py --preset quick --columns xmeas_1,xmeas_9,xmv_1

Intermittent Fault Mode

Generate trajectories where faults turn on and off, simulating realistic scenarios where faults occur, get fixed, and new faults appear:

# Generate 10 trajectories with all 20 faults cycling through
python examples/rieth2017_dataset.py --intermittent --n-simulations 10

# Custom timing: 3h fault duration, 1.5h normal between faults
python examples/rieth2017_dataset.py --intermittent --n-simulations 10 \
    --faults 1,4,6,11 \
    --fault-duration 3 \
    --normal-duration 1.5

# Less randomness and keep faults in order
python examples/rieth2017_dataset.py --intermittent \
    --duration-variance 0.2 \
    --no-randomize-order

Parameter	CLI Flag	Default	Description
-	`--intermittent`	-	Enable intermittent fault mode
`avg_fault_duration_hours`	`--fault-duration`	4.0	Average hours each fault is active
`avg_normal_duration_hours`	`--normal-duration`	2.0	Average hours between faults
`duration_variance`	`--duration-variance`	0.5	Variance factor (0.5 = ±50%)
`initial_normal_hours`	`--initial-normal`	1.0	Normal operation before first fault
`randomize_fault_order`	`--no-randomize-order`	True	Shuffle fault order in each trajectory

Python API:

from examples.rieth2017_dataset import Rieth2017DatasetGenerator

generator = Rieth2017DatasetGenerator(output_dir="./data/intermittent")

# Generate trajectories with faults 1-5, each fault ~3h on, ~1.5h off
data = generator.generate_intermittent_faults(
    n_simulations=10,
    fault_numbers=[1, 2, 3, 4, 5],
    avg_fault_duration_hours=3.0,
    avg_normal_duration_hours=1.5,
    duration_variance=0.5,        # ±50% randomness
    initial_normal_hours=1.0,     # 1h normal at start
    randomize_fault_order=True,   # Shuffle fault order
)

Output format: The output has the same structure as other datasets (55 columns), but the faultNumber column (column 0) changes over time as faults activate and deactivate (0 = normal operation).

Overlapping Fault Mode

Generate trajectories where multiple faults can be active simultaneously (up to 2 at a time by default):

# Generate 10 trajectories with overlapping faults
python examples/rieth2017_dataset.py --overlapping --n-simulations 10

# High overlap probability with specific faults
python examples/rieth2017_dataset.py --overlapping --n-simulations 10 \
    --faults 1,4,6,11 \
    --overlap-probability 0.7 \
    --fault-duration 4

# Custom gap and max concurrent faults
python examples/rieth2017_dataset.py --overlapping \
    --gap-hours 0.5 \
    --max-concurrent 2

Parameter	CLI Flag	Default	Description
-	`--overlapping`	-	Enable overlapping fault mode
`overlap_probability`	`--overlap-probability`	0.5	Probability next fault starts during previous (50%)
`max_concurrent_faults`	`--max-concurrent`	2	Maximum faults active simultaneously
`avg_gap_hours`	`--gap-hours`	1.0	Average gap when faults don't overlap
`avg_fault_duration_hours`	`--fault-duration`	4.0	Average hours each fault is active
`duration_variance`	`--duration-variance`	0.5	Variance factor (0.5 = ±50%)

Python API:

from examples.rieth2017_dataset import Rieth2017DatasetGenerator

generator = Rieth2017DatasetGenerator(output_dir="./data/overlapping")

# Generate trajectories with potential fault overlaps
data = generator.generate_overlapping_faults(
    n_simulations=10,
    fault_numbers=[1, 2, 3, 4, 5],
    overlap_probability=0.6,       # 60% chance of overlap
    max_concurrent_faults=2,       # Up to 2 faults at once
    avg_fault_duration_hours=4.0,
    avg_gap_hours=1.0,
)

Output encoding: When multiple faults are active simultaneously, the faultNumber column encodes them as:

0: Normal operation
1-20: Single fault active
101-2020: Two faults active (encoded as fault1*100 + fault2, e.g., faults 1 and 4 = 104)

Configurable Parameters

All simulation parameters can be customized via CLI or Python API:

Parameter	CLI Flag	Default	Description
`n_simulations`	`--n-simulations`	500	Simulations per fault type
`train_duration_hours`	`--train-duration`	25.0	Training simulation duration (hours)
`val_duration_hours`	`--val-duration`	48.0	Validation simulation duration (hours)
`test_duration_hours`	`--test-duration`	48.0	Testing simulation duration (hours)
`sampling_interval_min`	`--sampling-interval`	3.0	Sampling interval (minutes)
`fault_onset_hours`	`--fault-onset`	1.0	Fault onset time for val/test (hours)
`n_faults`	`--faults`	20	Number of fault types (or specific list)
`output_formats`	`--format`	npy	Output format(s): npy, csv, hdf5
`n_workers`	`--workers`	1	Number of parallel workers (-1 for all CPUs)
`columns`	`--columns`	all	Column subset or group name

CLI example with custom parameters:

python examples/rieth2017_dataset.py \
    --n-simulations 50 \
    --train-duration 10 \
    --test-duration 20 \
    --sampling-interval 1 \
    --fault-onset 0.5 \
    --faults 1,4,6 \
    --format npy,csv \
    --workers 4 \
    --columns key

Python API

from examples.rieth2017_dataset import Rieth2017DatasetGenerator

# Default Rieth 2017 parameters
generator = Rieth2017DatasetGenerator(output_dir="./data/rieth2017")
generator.generate_all()

# Using presets
generator = Rieth2017DatasetGenerator.from_preset("quick", output_dir="./data/quick")
generator.generate_all()

# Preset with overrides
generator = Rieth2017DatasetGenerator.from_preset(
    "quick",
    output_dir="./data/custom",
    n_simulations=20,
    output_formats=["npy", "csv"],
)

# Custom parameters with all new features
generator = Rieth2017DatasetGenerator(
    output_dir="./data/custom",
    n_simulations=100,
    train_duration_hours=10.0,
    test_duration_hours=20.0,
    sampling_interval_min=1.0,
    fault_onset_hours=0.5,
    output_formats=["npy", "csv"],  # Multiple formats
    n_workers=4,                     # Parallel generation
    columns="key",                   # Column subset
)
generator.generate_all(fault_numbers=[1, 4, 6])

# List available presets and column groups
print(Rieth2017DatasetGenerator.list_presets())
print(Rieth2017DatasetGenerator.list_column_groups())

# Or generate specific files
generator.generate_fault_free_training(n_simulations=500)
generator.generate_faulty_testing(fault_numbers=[1, 4, 6], n_simulations=100)

Loading Generated Data

from examples.rieth2017_dataset import load_rieth2017_dataset, get_fault_data, get_features

# Load all data files
data = load_rieth2017_dataset("./data/rieth2017")

# Access fault-free testing data
normal_test = data["fault_free_testing"]

# Extract data for a specific fault
fault1_data = get_fault_data(data["faulty_testing"], fault_number=1)

# Get feature columns only (52 columns: 41 xmeas + 11 xmv)
features = get_features(fault1_data)

Comparing with Harvard Dataverse Original

The script can download the original dataset from Harvard Dataverse and compare it with locally generated data:

# Download original dataset from Harvard Dataverse
python examples/rieth2017_dataset.py --download-harvard

# Compare generated data with original
python examples/rieth2017_dataset.py --compare

# Requirements for comparison
pip install requests pyreadr

Python API:

from examples.rieth2017_dataset import (
    HarvardDataverseDataset,
    compare_datasets,
    compare_with_harvard,
)

# Download and load original dataset
harvard = HarvardDataverseDataset()
harvard.download()
original_data = harvard.load("fault_free_training")

# Compare with generated data
results = compare_with_harvard(local_dir="./data/rieth2017")

# Or compare specific arrays
from examples.rieth2017_dataset import load_rieth2017_dataset
local_data = load_rieth2017_dataset("./data/rieth2017")
comparison = compare_datasets(
    local_data["fault_free_training"],
    original_data,
    name="fault_free_training"
)

The comparison reports:

Shape differences
Per-fault statistics for key variables
Mean correlation between datasets
Mean absolute percentage error (MAPE)

Related Datasets

Original Braatz Dataset

Single simulation per fault
Available at: https://github.com/camaramm/tennessee-eastman-profBraatz

Reinartz et al. 2021 Extended Dataset

28 fault types (including 8 additional random variation faults)
6 operating modes
Mode transitions
Available at: https://data.dtu.dk/articles/dataset/Tennessee_Eastman_Reference_Data_for_Fault-Detection_and_Decision_Support_Systems/13385936

References

Rieth, C.A., Amsel, B.D., Tran, R., Cook, M.B. (2018). Issues and Advances in Anomaly Detection Evaluation for Joint Human-Automated Systems. In: Advances in Human Factors in Robots and Unmanned Systems. AHFE 2017. Advances in Intelligent Systems and Computing, vol 595. Springer, Cham.
Downs, J.J., Vogel, E.F. (1993). A plant-wide industrial process control problem. Computers & Chemical Engineering, 17(3), 245-255.
Russell, E.L., Chiang, L.H., Braatz, R.D. (2000). Data-driven Methods for Fault Detection and Diagnosis in Chemical Processes. Springer-Verlag, London.