Extract MLB Prospect Lists with LLMs — No Code Needed
I’ve been tinkering with an idea: a site that combines publicly available MLB prospect lists and lets you weight each outlet’s rankings however you like. The data is messy, and the means of acquisition vary from source to source, but LLMs make extraction fast and surprisingly accurate.
Prospect list data acquisition ranges from very clean to ugly. Here's a quick survey of how lists are presented to the reader and therefore available to crawlers:
- Baseball Prospectus
- Data: table in an iframe. Table offers CSV download
- Features: name, rank, org, and one position.
- Count: 101 players
- Fangraphs (The Board)
- Data: CSV download
- Features: name, rank, org rank, org, position(s), level, future value, and much more
- Count: 1,051 players
- ESPN: Only top 50 list is public, top 100 is for Insiders only. This list is written as a news article, so rankings are interspersed with blurbs about one to many players.
- Data: article text
- Features: name, rank, org, age, and position(s)
- Count: 50 players
- MLB: rankings in a data table. Scooping this data seems easy until you notice that player organizations are an SVG graphic with a filename like "112.svg". Inspecting page source shows a JSON payload that likely populates the table, and team codes are part of that structure. Using that, we have name, rank, org, position, age, and level features. Compounding this is that the JSON data structure is
var
-declared inside of ascript
tag. You can use regular expressions if you're feeling bold, or, as I chose,tree-sitter
(a parsing tool for analyzing source code) to parse and then walk (but not execute) the JavaScript to extract the value.- Data: global-level JavaScript variable (
window.data
) - Features: name, rank, org, age, position(s), player stats
- Count: 100
- Data: global-level JavaScript variable (
- Razzball: rankings spread across four (4) URLs
- Data: article text
- Features: name, rank, org, position(s), age, level, and ETA
- Count: 100, 25 per URL
- Just Baseball:
- Data: data table of rankings
- Features: name, rank, org, age, level, position(s), ETA, and future value
- Count: 100
Collecting these lists means downloading files, scraping data tables, and in a few cases, doing some advanced inference from unstructured text. When lists are baked into editorial, extraction is all the harder. Often, recreating the list into a structured format by hand is faster and safer than building complicated regular expressions. LLMs are making this a thing of the past.
Recipe
Using Simon Willison's marvelous llm
tool and the overly capable trafilatura
library, automating list extraction from editorial content is now incredibly simple. I'm using OpenAI with gpt-4.1-nano
, and I'm certain this will work well with comparable options. I'm doing this at CLI level for simplicity.
Let's use ESPN and Razzball as examples of how LLMs take over the grunt work of converting the lurid prose of prospect lists into structured data.
Steps
trafilatura
extracts what it heuristically interprets to be the primary content of the page, and then converts HTML to Markdown.llm
receives that Markdown, and prompts my selected model to extract players from the list in a specific format. My prompt is simple, terse but unambiguous in its request.
The LLM should return a CSV with the following columns:
rank
: List rankingname
: Player's nameage
: Player's agepos
: Position(s) playedorg
: Team/organizationlevel
: Minor or major league level, where available.
The prompts for each list are generally uniform, designed to be simple and reusable: note that this is a list, tell the LLM how many players to expect, and be deliberate about extraction parameters and output format.
Read this text and create a list of the top prospects mentioned with their name, age, position, team name, and level. Be aware that rankings might not start at 1. You must extract all 25 players. Output in strict CSV format with these columns: rank,name,age,pos,org,level.
LLM prompt example for Razzball
Setup
First, install llm
and trafilatura
in a fresh Python environment. I'm using uv
, so setup looks something like this:
uv init -p 3.12
uv venv
uv add llm trafilatura
Installation instructions with uv
Once installed, configure llm with a key, adding when prompted:
uv run llm keys set openai
Configure llm
Running
Let's start with Razzball, who publish rankings in an article-like format. There isn't structured metadata to use, nor are they publishing in a table, so basic crawling and regex fiddling won't help us. Let's see what an LLM can do.
uv run trafilatura --markdown -u "https://razzball.com/top-100-prospects-for-2025-dynasty-fantasy-baseball/" | uv run llm --system "read this text and create a list of the top prospects mentioned with their name, age, position, team name, and level. be aware that rankings might not start at 1. you must extract all 25 players. output in strict csv format with these columns: rank,name,age,pos,org,level" -m gpt-4.1-nano
Piping markdown version of Razzball's list to OpenAI for list extraction
Running this will fetch Razzball's #76-100 prospects, and enumerate them in the specified CSV format. Sample output:
rank name age pos org level
76 Colt Emerson 19 SS Mariners A+
77 Jac Caglianone 22 OF Royals A+
78 Charlie Condon 21 3B Rockies A+
79 Bryce Rainer 19 SS Tigers NA
80 Yairo Padilla 17 SS Cardinals DSL
Sample output of extraction results from Razzball, formatted by xsv
The Razzball prospect list was correctly extracted, including the rankings starting from 76.
Let's give the LLM something harder: ESPN's top 50 list. This page is copy-heavy, and has rankings stacked or broken out with blurbs depending on the player.
uv run trafilatura --markdown -u "https://www.espn.com/mlb/story/_/id/45223803/top-50-2025-mlb-prospects-updated-roman-anthony-bubba-chandler-marcelo-mayer"
| uv run llm --system "read this text and create a list of the top prospects mentioned with their name, age, position, and team name. you must extract all 50 players. output in strict csv format with these columns: rank,name,age,pos,org" -m gpt-4.1-nano
Piping markdown version of ESPN's list to OpenAI for list extraction
Again, that worked quite well.
rank name age pos org
1 Roman Anthony 21 OF Boston Red Sox
2 Bubba Chandler 22 RHP Pittsburgh Pirates
3 Leodalis De Vries 19 SS San Diego Padres
4 Sebastian Walcott 19 SS Texas Rangers
5 Jesus Made 18 SS Milwaukee Brewers
Sample output of extraction results from ESPN, formatted by xsv
Finally, let's automate Just Baseball, who have a very crawl-ready data table. It would be easy enough to scrape the list with requests-html
or lxml
, but it's Saturday.
uv run trafilatura --markdown -u "https://www.justbaseball.com/prospects/top-100-mlb-prospects/" | uv run llm --system "read this text and create a list of the top prospects mentioned with their name, age, position, level, and team name. you must extract all 100 players. output in strict csv format with these columns: rank,name,age,pos,org,level. quote the name and position fields." -m gpt-4.1-nano
Piping markdown version of Just Baseball's list to OpenAI for list extraction
1 Roman Anthony OF Boston Red Sox AAA
2 Bubba Chandler RHP Pittsburgh Pirates AAA
3 Leodalis De Vries SS San Diego Padres A+
4 Kevin McGonigle SS Detroit Tigers A+
5 Sebastian Walcott SS,3B Texas Rangers AA
Sample output of extraction results from Just Baseball, formatted by xsv
LLMs sometimes struggle with quoting CSV fields, especially when values contain commas. Be explicit in your prompt if you need quoted output, and always double-check output.
At the cost of a couple of minutes and less than a $0.01 USD, we now have these lists in a machine-readable format. LLMs are making this almost too easy.
In my application, I've built a single list extraction agent with pydantic-ai
. I use its ability to extend system prompts dynamically, injecting custom instructions for output columns and expected list size. Combined with a Pydantic model, this guarantees consistent and repeatable results and output shape.
Next up: how to normalize and link player records across outlets, so “Roman Anthony, Red Sox CF prospect” means the same thing everywhere.