ICML 2026

Adaptive Time Series Reasoning via Segment Selection

Shvat Messica1, Jiawen Zhang2,*, Kevin Li3,*, Theodoros Tsiligkaridis4, Marinka Zitnik1

1Harvard Medical School  ·  2HKUST (Guangzhou)  ·  3MIT  ·  4MIT Lincoln Laboratory *Equal contribution

ARTIST overview
(a) Time-series reasoning: answering a natural-language question given a time series. (b) ARTIST alternates between reasoning and adaptive segment selection, choosing the next segment based on the question and intermediate outputs, and stopping once it can produce the final answer.

Overview

Time-series reasoning tasks start with a natural-language question and require targeted analysis of a time series. Evidence may span the full series or appear in only a few short intervals, so the model must decide what to inspect. Most existing approaches encode the entire series into a fixed representation before inference, regardless of relevance.

ARTIST formulates time-series reasoning as a sequential decision problem. It interleaves reasoning with adaptive temporal segment selection using a controller–reasoner architecture and trains both roles with reinforcement learning:

Rather than relying on a static summary of the full sequence, ARTIST actively acquires task-relevant information at inference time. A hierarchical, collaborative self-play post-training method lets a single policy excel at both segment selection and question answering — improving average accuracy by 6.46 points over the strongest baseline across six benchmarks while typically using only 30–70% of the input series per query.

Method

ARTIST controller-reasoner architecture and hierarchical policy optimization
ARTIST generates interaction trajectories by alternating a controller (segment selection) and a reasoner (segment-conditioned answers), then performs a joint hierarchical policy update.

ARTIST is a single policy LLM that operates in different roles for time series reasoning: a controller that selects the next segment and decides when to stop, and a reasoner that produces segment-conditioned reasoning and the final answer. Training proceeds in two stages:

Results

Across six time-series reasoning benchmarks spanning clinical, financial, environmental, and general domains, ARTIST outperforms strong text LLMs, time-series reasoning models, and vision-language models. Accuracy and F1 (%) against general LLMs and time-series reasoning baselines:

Model ETIRCWECG-QA Sleep-QATSQATRQA Avg.
AccF1AccF1AccF1 AccF1AccF1AccF1AccF1
Random Guess25.0025.0050.0050.0050.0050.0016.6716.6729.6729.6737.1337.1334.7434.74
General-purpose LLMs
GPT-5
w/o statistics29.5027.9034.0734.8851.9851.780.491.7730.9227.2221.5021.6728.0827.54
w/ statistics63.5463.7232.7431.0853.9649.620.491.4935.7532.2425.0028.7035.2534.48
Llama-3 8B
w/o statistics25.5025.9643.9740.4150.0048.4531.9914.0948.9746.7557.6953.8843.0238.26
w/ statistics35.5034.3962.8345.8250.0048.9022.0613.3444.9344.1255.5053.0245.1439.93
Qwen3-8B
w/o statistics25.2524.610.000.0050.6248.464.906.7111.3511.0214.6315.2017.8817.87
w/ statistics45.0040.880.000.0051.8649.061.963.8713.5312.7815.2514.0221.3520.31
Qwen3-14B
w/o statistics32.0030.6633.1927.4050.9950.423.435.7324.1522.5033.0029.5429.4627.08
w/ statistics42.0047.3422.5722.9154.5851.793.434.4824.1522.2029.0030.3229.2929.84
Time-series reasoning models
ChatTS-14B
Base Model31.0021.8769.9130.2448.0232.0717.6513.1943.4830.7257.5042.1244.5928.37
+ SFT50.5040.6973.8933.1053.4724.3126.4714.6046.3842.6569.0055.5753.2835.15
OpenTSLM-4B
+ SFT82.6982.6665.4938.2969.5041.0035.3718.9947.5035.8176.2569.3662.8047.68
ITFormer-4B
+ SFT84.6284.6067.3157.9557.3149.9133.6215.7749.5023.6280.1274.2262.0851.01
ARTIST (ours)
+ SFT85.1285.1169.7561.4656.3155.6828.1317.9460.0657.1382.2662.3263.6156.61
+ SFT + RL87.0387.1077.0050.0069.8152.6736.6319.2162.0058.6683.0678.0269.2657.61
Improvement+2.41+2.50+3.11+3.51+3.14+3.89+1.26+0.22+12.50+11.91+2.94+3.80+6.46+6.60

ARTIST (SFT + RL) highlighted; Improvement is the gain over the strongest baseline per dataset and metric. The largest gains appear on rare-event localization and multi-segment reasoning (up to +12.5 points). Qwen3-8B failed to produce the required answer template on RCW (reported as 0.00). See the paper for vision-language comparisons, ablations, long-sequence scalability, and inference-cost analyses.

Citation

Copy@inproceedings{messica2026artist,
  title     = {Adaptive Time Series Reasoning via Segment Selection},
  author    = {Messica, Shvat and Zhang, Jiawen and Li, Kevin and
               Tsiligkaridis, Theodoros and Zitnik, Marinka},
  booktitle = {Proceedings of the 43rd International Conference on Machine Learning (ICML)},
  year      = {2026}
}