DATA PREP CHARTS PLAN EXP RESULTS

Data Analysis

Ask & explore

summarise, plot, segments, BI — answered in plain English the moment you load the file.

"how many companies?" · "plot ARR" →

Data Science

Train a model

forecast, predict, classify, optimize — OctOpus runs experiments and ships a model.pkl.

"predict who hits $10M ARR" →

Data Engineering

Clean & transform

clean, dedupe, normalise, ETL — every message becomes a transform step on your data.

"drop dupes · normalise arr" →

Connect your data

21 connectors · or drop a CSV / Parquet / ZIP

CSV

CSV / Parquet

file upload

BigQuery

warehouse

Snowflake

warehouse

Databricks

warehouse

Redshift

warehouse

Postgres

database

MySQL

database

MongoDB

database

Amazon S3

object store

Google Sheets

SaaS

Notion

SaaS

ClickHouse

warehouse

yc_recent_batches.csv

100 companies · 17 cols · batches S22 → S24

company_name	batch	industry	arr_usd	post_money_valuation_usd	mom_growth_pct	months_since_demo_day
YukonSec	S23	Security	541,153	42M	14.2	12
NovaOS	W23	Dev Tools	532,890	38M	9.8	18
SpireAPI	S22	API Infra	463,724	55M	11.4	24
EchoLabs	W24	AI / ML	421,224	120M	22.5	6
IrisPay	S22	Fintech	403,950	68M	8.1	24
HaloLabs	S23	AI / ML	253,504	72M	17.8	12
ZeroBio	W23	Biotech	235,961	28M	6.2	18
FluxSec	S24	Security	204,238	15M	28.4	3
ArcEngine	S24	Dev Tools	193,454	12M	31.0	3
KaiTech	W24	AI / ML	191,923	85M	19.2	6

Data Engineering · live transforms

DE mode on

DROP post_money_valuation_usd remove column

FORMAT founded text → date · YYYY-MM

NEW COL arr_per_employee = arr_usd / employees

MERGE + funding_rounds.csv left join on company_id

Result · 97 rows · 6 cols recipe saved

company	founded	arr_musd	arr_per_employee NEW	last_round MERGED
YukonSec	2023-04	0.54	31.8k	Seed
NovaOS	2023-01	0.53	29.6k	Series A
SpireAPI	2022-08	0.46	24.4k	Series A
EchoLabs	2024-02	0.42	42.1k	Seed
IrisPay	2022-06	0.40	18.4k	Series B

ARR distribution across industries

100 companies · median $23.6k · skew right

AI / ML

$284k

Fintech

$231k

Dev Tools

$190k

Security

$148k

API Infra

$112k

Biotech

$64k

Dashboard · YC cohort overview

auto-generated

$7.06M

Total ARR

Companies

14.8%

Median MoM growth

AI / ML

Top industry

ARR by industry

ARR trend · S22 → S24

Industry mix

MoM growth vs ARR

Research plan program.md

targetreached_10m_arr (binary derived from arr_usd)

taskclassification

metricROC-AUC, holdout-validated

splitgroup k-fold by batch (so later cohorts can't leak into training)

featuresindustry, founders, technical_founders, employees, last_round_usd, runway_months, mom_growth_pct, has_paying_customers, months_since_demo_day

experiments1) CatBoost · 2) LightGBM (Optuna 30 trials) · 3) TabPFN · 4) stacking ensemble

guardrailsleakage detection, holdout outside agent workspace, scrubbed env vars

Research loop

running 4 experiments

CatBoost · baseline

AUC 0.82

LightGBM · Optuna ×30

AUC 0.85

TabPFN · zero-shot

AUC 0.87

Stacking ensemble · 3 base learners

AUC 0.89

WINNER

Stacking ensemble · validated winner

0.89AUC · holdout

Octopus

Hi, I'm OctOpus. Drop a dataset and tell me your business objective — I'll build, validate, and deliver a model, or answer questions about your data.

21:29

Octopus

Got it. I can see your yc_recent_batches.csv dataset. Profiling it now…

21:30

drop duplicate rows, normalise arr to $M, and flag companies past $10M ARR

21:30

Octopus

DE mode — applied 3 transform steps: removed 3 duplicates, normalised arr_usd → arr_musd, and derived reached_10m_arr. Saved as a reusable recipe — your data is clean and model-ready.

21:30

Octopus

GOALWhat's your goal?

Skip

21:30

Data insight (Ask OctOpus)

21:30

Octopus

Insights · FocusWhat should I inspect?

Skip

21:30

Summary

21:30

Octopus

Here's what I see in this dataset:

It covers Y Combinator companies from recent batches — 17 columns capturing identity, funding, growth, and team composition.

Key financial signals available: arr_usd, last_round_usd, post_money_valuation_usd, runway_months. mom_growth_pct is the velocity signal.

Team data: founders, technical_founders, employees capture team composition.

Stage spread: has_paying_customers is a binary flag separating pre-revenue from revenue-generating companies.

Targets worth modeling: post_money_valuation_usd, runway_months, or a derived reached_10m_arr binary.

21:31

what's the total arr and the top companies?

21:31

Octopus

Across 100 companies, total ARR is $7,056,039. Top by ARR is YukonSec at $541,153 — there's a notable spread between leaders and the rest.

company name	arr_usd
YukonSec	541,153
NovaOS	532,890
SpireAPI	463,724
EchoLabs	421,224
IrisPay	403,950

+ 95 more rows

21:31

show me arr distribution across industries

21:31

Octopus

ARR distribution across industries

arr_usd ranges $0 → $541k, median $23.6k, mean $110.2k (right-skewed). AI/ML leads with $284k median.

21:31

build a dashboard of the key metrics

21:31

Octopus

Dashboard ready. Built 4 KPI tiles and 4 charts — ARR by industry, ARR trend, industry mix, and growth vs ARR. Opened it in the workspace; export as PNG or embed anytime.

21:32

train a model that predicts $10M arr

21:32

Octopus

Plan ready. Group-fold by batch so later cohorts can't leak. Open the Plan tab to inspect, or type go.

21:32

looks good, run it

21:32

Octopus

Running 4 experiments — CatBoost baseline, LightGBM with Optuna, TabPFN zero-shot, and a stacking ensemble of the strongest base learners.

21:33

Octopus

Done. Best model: stacking ensemble at AUC 0.89 (holdout). model.pkl ready in the Results tab.

21:35

OctOpus is idle

Ask anything about your data or model

1 / 11 Step 1 · Pick a path

Pricing Try it on your data →

Ask & explore

Train a model

Clean & transform

ARR distribution across industries

See the public leaderboard right after the product walkthrough.

Reproduce any benchmark in the product