Extract MLB Prospect Lists with LLMs — No Code Needed

I’ve been tinkering with an idea: a site that combines publicly available MLB prospect lists and lets you weight each outlet’s rankings however you like. The data is messy, and the means of acquisition vary from source to source, but LLMs make extraction fast and surprisingly accurate.

Prospect list data acquisition ranges from very clean to ugly. Here's a quick survey of how lists are presented to the reader and therefore available to crawlers:

  • Baseball Prospectus
    • Data: table in an iframe. Table offers CSV download
    • Features: name, rank, org, and one position.
    • Count: 101 players
  • Fangraphs (The Board)
    • Data: CSV download
    • Features: name, rank, org rank, org, position(s), level, future value, and much more
    • Count: 1,051 players
  • ESPN: Only top 50 list is public, top 100 is for Insiders only. This list is written as a news article, so rankings are interspersed with blurbs about one to many players.
    • Data: article text
    • Features: name, rank, org, age, and position(s)
    • Count: 50 players
  • MLB: rankings in a data table. Scooping this data seems easy until you notice that player organizations are an SVG graphic with a filename like "112.svg". Inspecting page source shows a JSON payload that likely populates the table, and team codes are part of that structure. Using that, we have name, rank, org, position, age, and level features. Compounding this is that the JSON data structure is var -declared inside of a script tag. You can use regular expressions if you're feeling bold, or, as I chose, tree-sitter (a parsing tool for analyzing source code) to parse and then walk (but not execute) the JavaScript to extract the value.
    • Data: global-level JavaScript variable (window.data)
    • Features: name, rank, org, age, position(s), player stats
    • Count: 100
  • Razzball: rankings spread across four (4) URLs
    • Data: article text
    • Features: name, rank, org, position(s), age, level, and ETA
    • Count: 100, 25 per URL
  • Just Baseball:
    • Data: data table of rankings
    • Features: name, rank, org, age, level, position(s), ETA, and future value
    • Count: 100

Collecting these lists means downloading files, scraping data tables, and in a few cases, doing some advanced inference from unstructured text. When lists are baked into editorial, extraction is all the harder. Often, recreating the list into a structured format by hand is faster and safer than building complicated regular expressions. LLMs are making this a thing of the past.

Recipe

Using Simon Willison's marvelous llm tool and the overly capable trafilatura library, automating list extraction from editorial content is now incredibly simple. I'm using OpenAI with gpt-4.1-nano, and I'm certain this will work well with comparable options. I'm doing this at CLI level for simplicity.

Let's use ESPN and Razzball as examples of how LLMs take over the grunt work of converting the lurid prose of prospect lists into structured data.

Steps

  1. trafilatura extracts what it heuristically interprets to be the primary content of the page, and then converts HTML to Markdown.
  2. llm receives that Markdown, and prompts my selected model to extract players from the list in a specific format. My prompt is simple, terse but unambiguous in its request.

The LLM should return a CSV with the following columns:

  • rank: List ranking
  • name: Player's name
  • age: Player's age
  • pos: Position(s) played
  • org: Team/organization
  • level: Minor or major league level, where available.

The prompts for each list are generally uniform, designed to be simple and reusable: note that this is a list, tell the LLM how many players to expect, and be deliberate about extraction parameters and output format.

Read this text and create a list of the top prospects mentioned with their name, age, position, team name, and level. Be aware that rankings might not start at 1. You must extract all 25 players. Output in strict CSV format with these columns: rank,name,age,pos,org,level.

LLM prompt example for Razzball

Setup

First, install llm and trafilatura in a fresh Python environment. I'm using uv, so setup looks something like this:

uv init -p 3.12
uv venv
uv add llm trafilatura

Installation instructions with uv

Once installed, configure llm with a key, adding when prompted:

uv run llm keys set openai

Configure llm

Running

Let's start with Razzball, who publish rankings in an article-like format. There isn't structured metadata to use, nor are they publishing in a table, so basic crawling and regex fiddling won't help us. Let's see what an LLM can do.

uv run trafilatura --markdown -u "https://razzball.com/top-100-prospects-for-2025-dynasty-fantasy-baseball/" | uv run llm --system "read this text and create a list of the top prospects mentioned with their name, age, position, team name, and level. be aware that rankings might not start at 1. you must extract all 25 players. output in strict csv format with these columns: rank,name,age,pos,org,level" -m gpt-4.1-nano

Piping markdown version of Razzball's list to OpenAI for list extraction

Running this will fetch Razzball's #76-100 prospects, and enumerate them in the specified CSV format. Sample output:

rank  name            age  pos  org        level
76    Colt Emerson    19   SS   Mariners   A+
77    Jac Caglianone  22   OF   Royals     A+
78    Charlie Condon  21   3B   Rockies    A+
79    Bryce Rainer    19   SS   Tigers     NA
80    Yairo Padilla   17   SS   Cardinals  DSL

Sample output of extraction results from Razzball, formatted by xsv

The Razzball prospect list was correctly extracted, including the rankings starting from 76.

Let's give the LLM something harder: ESPN's top 50 list. This page is copy-heavy, and has rankings stacked or broken out with blurbs depending on the player.

uv run trafilatura --markdown -u "https://www.espn.com/mlb/story/_/id/45223803/top-50-2025-mlb-prospects-updated-roman-anthony-bubba-chandler-marcelo-mayer"
| uv run llm --system "read this text and create a list of the top prospects mentioned with their name, age, position, and team name. you must extract all 50 players. output in strict csv format with these columns: rank,name,age,pos,org" -m gpt-4.1-nano

Piping markdown version of ESPN's list to OpenAI for list extraction

Again, that worked quite well.

rank  name               age  pos  org
1     Roman Anthony      21   OF   Boston Red Sox
2     Bubba Chandler     22   RHP  Pittsburgh Pirates
3     Leodalis De Vries  19   SS   San Diego Padres
4     Sebastian Walcott  19   SS   Texas Rangers
5     Jesus Made         18   SS   Milwaukee Brewers

Sample output of extraction results from ESPN, formatted by xsv

Finally, let's automate Just Baseball, who have a very crawl-ready data table. It would be easy enough to scrape the list with requests-html or lxml, but it's Saturday.

uv run trafilatura --markdown -u "https://www.justbaseball.com/prospects/top-100-mlb-prospects/" | uv run llm --system "read this text and create a list of the top prospects mentioned with their name, age, position, level, and team name. you must extract all 100 players. output in strict csv format with these columns: rank,name,age,pos,org,level. quote the name and position fields." -m gpt-4.1-nano

Piping markdown version of Just Baseball's list to OpenAI for list extraction

1   Roman Anthony      OF     Boston Red Sox      AAA
2   Bubba Chandler     RHP    Pittsburgh Pirates  AAA
3   Leodalis De Vries  SS     San Diego Padres    A+
4   Kevin McGonigle    SS     Detroit Tigers      A+
5   Sebastian Walcott  SS,3B  Texas Rangers       AA

Sample output of extraction results from Just Baseball, formatted by xsv

LLMs sometimes struggle with quoting CSV fields, especially when values contain commas. Be explicit in your prompt if you need quoted output, and always double-check output.


At the cost of a couple of minutes and less than a $0.01 USD, we now have these lists in a machine-readable format. LLMs are making this almost too easy.

In my application, I've built a single list extraction agent with pydantic-ai. I use its ability to extend system prompts dynamically, injecting custom instructions for output columns and expected list size. Combined with a Pydantic model, this guarantees consistent and repeatable results and output shape.


Next up: how to normalize and link player records across outlets, so “Roman Anthony, Red Sox CF prospect” means the same thing everywhere.

Subscribe to Singletons Going Steady

Don’t miss out on the latest issues. Sign up now to get access to the library of members-only issues.
jamie@example.com
Subscribe