Building a Semantic Search API with MySQL Vector Search, Oracle Cloud, and an NBA Kaggle Dataset

Semantic search enables users to ask for “a 3-and-D wing who can guard multiple positions” (as described in The Kings Beat article) and retrieve the correct NBA players, rather than simply a keyword match on “3,” “wing,” or “defense.” By combining MySQL Vector Search, Oracle Cloud’s Generative AI Embeddings, and a real NBA dataset from Kaggle, you can build a powerful natural-language search API that understands the meaning behind a query.As I love sports, I think this would be a great demonstration.

Chicago Bulls vs. Oklahoma City Thunder, NBA match — March 2016.

In this tutorial, you’ll build:

  • A MySQL vector-backed database of NBA players
  • A background ingestion + embedding pipeline that transforms player stats into vector representations using Oracle Cloud
  • A FastAPI semantic-search endpoint that returns the most relevant players for any natural-language description

We’ll use the following tools and datasets:

  • MySQL with VECTOR support — available in recent versions of MySQL such as MySQL 9.x and HeatWave, which include native vector types, indexes, and distance functions
  • Oracle Cloud Infrastructure (OCI) Generative AI — specifically OCI Embeddings, which provides embedding models suitable for semantic search, similarity matching, and Retrieval-Augmented Generation (RAG)
  • NBA Dataset from Kaggle — for example, the dataset “NBA Players Stats Since 1950”, which contains per-player, per-season statistics
  • Python + FastAPI — to build a lightweight API layer that lets users submit queries like “elite rim-protector who can switch onto guards” and receive semantically ranked player results

1. Prerequisites

To follow this tutorial, you’ll need an OCI tenancy with access to OCI Generative AI, including a configured ~/.oci/config profile (such as DEFAULT), your compartment OCID, and the appropriate Generative AI inference endpoint (for example, the Chicago regional endpoint). You’ll also need a MySQL 9.x / MySQL HeatWave on OCI instance with full vector capabilities enabled—specifically support for the VECTOR data type along with vector functions such as STRING_TO_VECTOR(string) and DISTANCE(vector, vector, 'COSINE'), which allow MySQL to store embeddings efficiently and perform fast semantic similarity searches.

Local environment:

Python 3.10+ and:

pip install fastapi "uvicorn[standard]" mysql-connector-python pandas python-dotenv oci pytest httpx

….

Before adding the environment variables, start by navigating to the root of your project directory, this is where your application code lives and where the configuration file needs to reside. From your terminal, move into the project folder, for example with cd ~/projects/nba-semantic-search, and create a new .env file using touch .env. This .env file will store the MySQL credentials and Oracle Cloud settings that your application needs at runtime, and it will be automatically loaded by python-dotenv when the FastAPI service starts.

Environment variables (e.g. in .env):

MYSQL_HOST=127.0.0.1
MYSQL_USER=root
MYSQL_PASSWORD=yourpassword   ##change value here
MYSQL_DB=nba_semantic 

OCI_CONFIG_PROFILE=DEFAULT
OCI_COMPARTMENT_OCID=ocid1.compartment.oc1..xxxx. ##change value here
OCI_GENAI_ENDPOINT=https://inference.generativeai.us-chicago-1.oci.oraclecloud.com
OCI_EMBED_MODEL_ID=cohere.embed-english-v3.0

From OCI Generative AI Embedding section

To run the project, you’ll need to configure several environment variables that allow your application to connect to both MySQL and Oracle Cloud. For the database layer, provide the MySQL host address, your username and password, and the name of the schema you created such as MYSQL_HOST=127.0.0.1, MYSQL_USER=root, MYSQL_PASSWORD=yourpassword, and MYSQL_DB=nba_semantic. For Oracle Cloud, you’ll use the same profile defined in your ~/.oci/config file (for example, OCI_CONFIG_PROFILE=DEFAULT), along with your compartment OCID, which can be copied directly from the OCI console. You’ll also need the regional Generative AI inference endpoint—for example, OCI_GENAI_ENDPOINT=https://inference.generativeai.us-chicago-1.oci.oraclecloud.com and the ID of the embedding model you want to use, such as OCI_EMBED_MODEL_ID=cohere.embed-english-light-v2.0. Together, these variables allow your FastAPI service to authenticate with OCI, call the embedding model, and write vector data into MySQL seamlessly.

You will replace the placeholder values with your real credentials:

  • Your MySQL host, username, password, and database name
  • Your OCI compartment OCID, which you can copy from the OCI Console
  • The OCI Generative AI endpoint and model ID, found in the Generative AI dashboard
  • The OCI config profile, which matches the profile name in your ~/.oci/config file

Once saved, the .env file will be automatically loaded by python-dotenv, allowing your FastAPI application to authenticate with Oracle Cloud and connect to MySQL securely, without hardcoding secrets directly into the code.

2. MySQL schema with VECTOR (using MySQL HeatWave)

To store and query embeddings efficiently, we’ll use MySQL HeatWave, Oracle’s fully managed MySQL service on OCI. HeatWave includes native vector support, meaning it can store high-dimensional embeddings and compute similarity with functions like DISTANCE() directly inside the database. Before creating your table, you must connect to your HeatWave instance.

Connecting to MySQL HeatWave

  1. Sign in to the Oracle Cloud Console and navigate to:
    MySQL → DB Systems (MySQL) → your HeatWave instance. (I created a instance without HA with HeatWave cluster just for this demo, and it took about 15–25 minutes. )
  2. In the instance details page:
    • Copy the Public Endpoint (e.g., mysql-xyz.abc123.mysql.database.oraclecloud.com).
    • Copy the admin username you set during creation.
    • Retrieve or set a password for that admin user.

It is essential to note that MySQL HeatWave DB Systems cannot be converted from private to public access after they are created. The networking configuration, specifically the choice of VCN, subnet, and whether the DB System is publicly accessible, is fixed at the time of creation and cannot be modified later. This means that if you deploy a HeatWave instance inside a private subnet, it will expose only a private IP address and cannot accept direct connections from outside the VCN. To allow public connections for development or testing, you must create a new MySQL HeatWave DB System in a public subnet with public access enabled. Alternatively, you can use a bastion host or VPN to reach the private endpoint. This limitation affects how you plan connectivity and is worth considering when setting up any environment that needs external access.

Once you have these details, connect from your terminal using the MySQL client:

mysql -h mysql-your-endpoint.mysql.database.oraclecloud.com \
      -u admin \
      -p

To connect to the MySQL HeatWave instance, make sure your machine can reach the database endpoint.
If you are running in an environment with restricted networking, you can also run the following commands from any compute resource that has access to the same VCN.
The connection string remains the same , simply use the private endpoint of the database.

3. Create the Database Schema

Inside your SSH session (VM → MySQL HeatWave):

SQL>
CREATE DATABASE nba_semantic;
USE nba_semantic;

We’ll create two tables:

  • players: raw player metadata and player_embeddings: embedding vectors

MySQL HeatWave’s vector type uses VARBINARY to store float32 vectors:


-- 
DROP TABLE IF EXISTS player_embeddings;

-- 
DROP TABLE IF EXISTS players;

--  Create players
CREATE TABLE players (
    player_id     INT PRIMARY KEY,
    label         VARCHAR(255),
    team_code     VARCHAR(10),
    franchise     VARCHAR(255),
    season_year   INT,
    game_date     DATE,
    is_playoffs   BOOLEAN,
    pts           INT,
    opp_pts       INT,
    elo_i         DOUBLE,
    elo_n         DOUBLE,
    opp_elo_i     DOUBLE,
    opp_elo_n     DOUBLE,
    game_location ENUM('H','A','N'),
    game_result   ENUM('W','L'),
    forecast      DOUBLE
);

--  Create embeddings table
CREATE TABLE player_embeddings (
    player_id INT PRIMARY KEY,
    embedding VECTOR(1024) NOT NULL,
    FOREIGN KEY (player_id) REFERENCES players(player_id)
);

Now we have a schema ready for vector search.

4. Load the NBA Dataset

On your laptop (or VM), download the dataset:

curl -O https://raw.githubusercontent.com/fivethirtyeight/data/master/nba-elo/nbaallelo.csv

The dataset used in this project originates from FiveThirtyEight’s NBA Elo archive and contains a rich, game-by-game history of professional basketball dating back to the 1940s. Each row represents one team’s perspective on a specific game, including metadata such as the league (lg_id), season and date, and whether it occurred in the playoffs. It provides identifiers for both the team and its opponent (team_id, fran_id, opp_id, opp_fran), along with final scores, pre- and post-game Elo ratings, win probability forecasts, and contextual attributes like home/away status and game result. Even though it is fundamentally a game-level dataset, we transform it into a structure that can support semantic search over player-like entities by extracting team/franchise information and generating natural-language descriptions from those fields. This allows MySQL Vector Search and OCI Embeddings to treat each row as a “profile” that can be embedded and queried semantically, even though the original dataset was not player-centric.

Upload the CSV to the VM (You can create a processed CSV with players only):

scp -i ssh-key-2025-11-16.key -P 2222 nbaallelo.csv opc@127.0.0.1:/home/opc/

Inside MySQL:


 mysql --local-infile=1 -h 10.0.0.253  -u bruno -p

-- Pick the database
USE nba_semantic;   

-- Load the CSV

LOAD DATA LOCAL INFILE '/home/opc/nbaallelo.csv'
     INTO TABLE players
     FIELDS TERMINATED BY ','
     ENCLOSED BY '"'
     IGNORE 1 ROWS
     (
         @gameorder,
         @game_id,
         @lg_id,
         @iscopy,
         season_year,
         @date_game,
         @seasongame,
         is_playoffs,
         team_code,
         franchise,
         pts,
         elo_i,
         elo_n,
         @win_equiv,
         @opp_id,
         @opp_fran,
         opp_pts,
         opp_elo_i,
         opp_elo_n,
         game_location,
         game_result,
         forecast,
         @notes
     )
     SET
         player_id = CAST(@gameorder AS UNSIGNED),
         game_date = STR_TO_DATE(@date_game, '%m/%d/%Y'),
         label = CONCAT(franchise, ' (', season_year, ')');

Generate and store embeddings from Python

From your local laptop (where your code repo and .env live, not the vm), you’ll:

  1. Connect to MySQL (through the bastion / SSH tunnel you already set up).
  2. Read players from the players table.
  3. Call OCI Embeddings.
  4. Insert into player_embeddings.

This is a self-contained script we can drop into our project as ingest_embeddings.py (Change accordingly to your env):

import os
import json
import mysql.connector
from dotenv import load_dotenv
import oci

from oci.generative_ai_inference import GenerativeAiInferenceClient
from oci.generative_ai_inference.models import (
    EmbedTextDetails,
    OnDemandServingMode,
)

# ---------------------------------------------------------------------
# Load environment variables (.env)
# ---------------------------------------------------------------------
load_dotenv()

# MySQL config
MYSQL_HOST = os.getenv("MYSQL_HOST", "127.0.0.1")
# If you are using an SSH tunnel, set MYSQL_PORT in .env (e.g., 3307)
MYSQL_PORT = int(os.getenv("MYSQL_PORT", "3306"))
MYSQL_USER = os.getenv("MYSQL_USER", "youruserhere in my case bruno")
MYSQL_PASSWORD = os.getenv("MYSQL_PASSWORD", "yourpasswordhere")
MYSQL_DB = os.getenv("MYSQL_DB", "nba_semantic")

# OCI config
OCI_CONFIG_PROFILE = os.getenv("OCI_CONFIG_PROFILE", "DEFAULT")
OCI_EMBED_MODEL_ID = os.getenv(
    "OCI_EMBED_MODEL_ID",
    "cohere.embed-english-v3.0",  # override in .env if needed
)
OCI_GENAI_ENDPOINT = os.getenv("OCI_GENAI_ENDPOINT")
OCI_COMPARTMENT_OCID = os.getenv("OCI_COMPARTMENT_OCID")

if not OCI_GENAI_ENDPOINT:
    raise RuntimeError("OCI_GENAI_ENDPOINT is not set. Put it in your .env file.")

if not OCI_COMPARTMENT_OCID:
    raise RuntimeError("OCI_COMPARTMENT_OCID is not set. Put it in your .env file.")

# ---------------------------------------------------------------------
# MySQL connection
# ---------------------------------------------------------------------
cnx = mysql.connector.connect(
    host=MYSQL_HOST,
    port=MYSQL_PORT,
    user=MYSQL_USER,
    password=MYSQL_PASSWORD,
    database=MYSQL_DB,
)
cnx.autocommit = True
cur = cnx.cursor(dictionary=True)

# ---------------------------------------------------------------------
# OCI Generative AI client
# ---------------------------------------------------------------------
config = oci.config.from_file(
    os.path.expanduser("~/.oci/config"),
    OCI_CONFIG_PROFILE,
)

genai_client = GenerativeAiInferenceClient(
    config=config,
    service_endpoint=OCI_GENAI_ENDPOINT,
)

# ---------------------------------------------------------------------
# Helpers
# ---------------------------------------------------------------------
def get_players(limit=None):
    """
    Fetch rows from the players table.
    Each row represents a team-game that we treat as a 'player-like' entity.
    """
    sql = """
        SELECT
            player_id,
            label,
            team_code,
            franchise,
            season_year,
            game_date,
            is_playoffs,
            game_location,
            game_result,
            pts,
            opp_pts,
            elo_i,
            elo_n,
            opp_elo_i,
            opp_elo_n,
            forecast
        FROM players
    """
    if limit:
        sql += " LIMIT %s"
        cur.execute(sql, (limit,))
    else:
        cur.execute(sql)
    return cur.fetchall()


def embed_batch(texts):
    """
    Call OCI Generative AI embedding endpoint on a batch of texts.

    `texts` must be a List[str].
    """
    if not texts:
        return []

    texts = [str(t) for t in texts]

    serving_mode = OnDemandServingMode(
        model_id=OCI_EMBED_MODEL_ID,
    )

    details = EmbedTextDetails(
        inputs=texts,
        serving_mode=serving_mode,
        compartment_id=OCI_COMPARTMENT_OCID,
    )

    resp = genai_client.embed_text(details)

    # resp.data.embeddings is already List[List[float]]
    return resp.data.embeddings


def insert_embeddings(rows, vectors):
    """
    Insert or update embeddings in player_embeddings.

    We store JSON and let MySQL convert it with STRING_TO_VECTOR().
    """
    sql = """
        INSERT INTO player_embeddings (player_id, embedding)
        VALUES (%s, STRING_TO_VECTOR(%s))
        ON DUPLICATE KEY UPDATE embedding = VALUES(embedding)
    """

    data = [
        (row["player_id"], json.dumps(vec))
        for row, vec in zip(rows, vectors)
    ]

    cur.executemany(sql, data)
    print(f"Inserted/updated {cur.rowcount} embeddings")


def main(batch_size=64, limit=None):
    players = get_players(limit)
    print(f"Total players to embed: {len(players)}")

    for i in range(0, len(players), batch_size):
        batch = players[i : i + batch_size]

        texts = []
        for p in batch:
            # Map codes to human-readable text
            location = (
                "home" if p["game_location"] == "H"
                else "away" if p["game_location"] == "A"
                else "neutral site"
            )
            season_type = "playoff game" if p["is_playoffs"] else "regular-season game"
            result = "win" if p["game_result"] == "W" else "loss"

            # Nice natural-language description for semantic embeddings
            text = (
                f"{p['label']} — {p['franchise']} ({p['team_code']}). "
                f"{season_type} in {p['season_year']}, {location}, "
                f"{result}, final score {p['pts']}–{p['opp_pts']}. "
                f"Elo change {p['elo_i']} to {p['elo_n']}, "
                f"opponent Elo {p['opp_elo_i']} to {p['opp_elo_n']}, "
                f"pre-game win probability {p['forecast']:.3f}."
            )
            texts.append(text)

        vectors = embed_batch(texts)
        insert_embeddings(batch, vectors)

        print(f"Processed {i + len(batch)} / {len(players)}")


# ---------------------------------------------------------------------
# Entry point
# ---------------------------------------------------------------------
if __name__ == "__main__":
    # You can pass limit for quick tests, e.g. main(limit=100)
    main()

Run it:

python ingest_embeddings.py
(base)nba-semantic-search % python ingest_embeddings.py
Total players to embed: 63157
Inserted/updated 64 embeddings
Processed 64 / 63157
Inserted/updated 64 embeddings
Processed 128 / 63157
Inserted/updated 64 embeddings
Processed 192 / 63157
Inserted/updated 64 embeddings
Processed 256 / 63157
...
Processed 35584 / 63157
Inserted/updated 64 embeddings
Processed 35648 / 63157
Inserted/updated 64 embeddings
Processed 35712 / 63157
Inserted/updated 64 embeddings
Processed 35776 / 63157
Inserted/updated 64 embeddings
Processed 35840 / 63157
Inserted/updated 64 embeddings
Processed 35904 / 63157
...
Processed 62528 / 63157
Inserted/updated 64 embeddings
Processed 62592 / 63157
Inserted/updated 64 embeddings
Processed 62656 / 63157
Inserted/updated 64 embeddings
Processed 62720 / 63157
Inserted/updated 64 embeddings
Processed 62784 / 63157
Inserted/updated 64 embeddings
Processed 62848 / 63157
Inserted/updated 64 embeddings
Processed 62912 / 63157
Inserted/updated 64 embeddings
Processed 62976 / 63157
Inserted/updated 64 embeddings
Processed 63040 / 63157
Inserted/updated 64 embeddings
Processed 63104 / 63157
Inserted/updated 53 embeddings
Processed 63157 / 63157

This might take a few minutes depending on batch size and OCI latency. When done, verify:


mysql> USE nba_semantic;   
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Database changed
mysql> 
mysql>  SELECT COUNT(*) FROM player_embeddings;
+----------+
| COUNT(*) |
+----------+
|    63157 |
+----------+
1 row in set (0.01 sec)

Running a semantic search directly in MySQL

With embeddings stored, we can now:

  1. Embed the query text in Python using the same OCI model
  2. Use MySQL’s DISTANCE() to rank players by cosine similarity

5. SQL template

In MySQL, the ranking query will look like:

SET @q := (
    SELECT embedding
    FROM player_embeddings
    WHERE player_id = 1
);

SELECT
    p.player_id,
    p.label,
    p.franchise,
    p.team_code,
    p.season_year,
    p.game_result,
    DISTANCE(e.embedding, @q, 'COSINE') AS score
FROM player_embeddings e
JOIN players p USING (player_id)
ORDER BY score
LIMIT 10;

The query above retrieves the ten most semantically similar games to the one represented by player_id = 1, using MySQL’s native vector search with cosine similarity. In this dataset, each row represents a single NBA team’s performance in a specific game, encoded into a 1024-dimension embedding generated with OCI Generative AI. Because player_id = 1 corresponds to a Toronto Huskies game from 1947, the model identifies other Huskies games from the same season as the closest matches. This makes sense: these entries share the same franchise, the same roster and style of play, the same era, and similar statistical patterns, so their text descriptions produced similar embeddings. The resulting list shows nearly identical “Huskies (1947)” labels, all with very small cosine distances, demonstrating that the vector model captures meaningful similarity based on team identity, season context, and game characteristics.

6. More Semantic Search Examples

Once embeddings are stored in MySQL HeatWave, you can run a wide variety of semantic searches. Because each row represents a natural-language description of a game, the search engine understands contextual meaning such as team strength, season, game type, location, and outcome.

Below we explore several realistic scenarios.

6.1 Example: “Elite rim protector who can switch onto guards”

One powerful aspect of semantic search is that it allows users to express what they are looking for in natural language, even when the underlying dataset doesn’t contain those exact terms. To illustrate this, consider the following query:

“elite rim protector who can switch onto guards”

This is not a traditional statistical request (e.g., rebounds per game, blocks, or defensive rating). Instead, it’s a high-level basketball concept that blends multiple defensive skills:

  • Rim protection — shot-blocking, help defense near the basket
  • Switchability — the ability for bigs to guard smaller, faster players
  • Versatility — holding up in mismatches created by pick-and-rolls

Although the NBA Kaggle dataset does not explicitly label players with qualitative traits like “elite rim protector,” the descriptions embedded for each row still capture contextual patterns around defensive dominance, paint protection, and matchups with guards. Because the query and dataset rows share the same embedding model (cohere.embed-english-v3.0), MySQL can compute cosine similarity and retrieve records that are conceptually aligned with the query.

Running the Python script:

python semantic_query.py "elite rim protector who can switch onto guards"

Even though the dataset consists of game-level entries rather than player scouting reports, the vector embeddings encode linguistic context well enough that the query still retrieves games associated with strong defensive performance, rim protection, and matchups involving guards.

This example showcases the real strength of combining OCI’s embedding models with MySQL’s native vector search: you can query your structured data using human concepts, and MySQL will return the most semantically relevant rows — even when the dataset doesn’t contain those words at all.

6.2 “underrated playmaker with high basketball IQ but low usage rate”

Semantic search shines brightest when the query mixes intangibles, advanced concepts, and implicit statistical context. Consider a second example:

“underrated playmaker with high basketball IQ but low usage rate”

This is the type of phrase a scout, analyst, or coach might use — but it’s not something that appears verbatim in any row of the NBA dataset.
The query combines several abstract qualities:

  • underrated — subjective, depends on context
  • playmaker — someone who creates opportunities for teammates
  • high basketball IQ — a measure of decision-making and awareness
  • low usage rate — low volume but high efficiency or impact

These ideas may be reflected indirectly in game summaries or statistical combinations (assists, turnovers, team effects), but never as explicit text. And yet, because both the game descriptions and the query are converted into embeddings using the same OCI model, MySQL can still retrieve the most semantically similar game entries

python semantic_query.py "underrated playmaker with high basketball IQ but low usage rate"

Although the dataset does not explicitly contain terms like “underrated,” “high IQ,” or “low usage rate,” the embedding model captures the semantic structure behind these concepts by analyzing relationships in the textual descriptions associated with each player or game entry.

The results above show the top-ranked entries whose embeddings are closest to the query vector. These rows likely correspond to:

  • secondary playmakers who contribute through smart passing
  • efficient decision-makers who don’t dominate the ball
  • players who impact the flow of the game without high shot volume
  • possessions describing intelligent ball movement or reading defenses

In other words, the system is not searching for literal matches like "playmaker" or "usage rate" instead, it matches conceptual meaning. Entries with similar linguistic patterns, statistical combinations, or contextual descriptions surface to the top because their embeddings occupy nearby regions in vector space.

This example demonstrates an important advantage of semantic search:
you can express nuanced basketball evaluations in natural language, and MySQL’s vector engine—paired with OCI’s embeddings—will still return highly relevant results even when the dataset does not contain those words at all.

Conclusion

Building a semantic search API by combining MySQL HeatWave Vector Search, OCI Generative AI Embeddings, and a rich NBA historical dataset demonstrates how modern data systems can bridge the gap between structured statistics and human-level basketball reasoning. Even though the FiveThirtyEight dataset contains no scouting reports, no player attributes, and no linguistic descriptors, the system still understands natural-language queries such as “elite rim protector who can switch onto guards” or “underrated playmaker with high basketball IQ but low usage rate.”

By transforming each game entry into a natural-language description and embedding it using Cohere’s English embedding model on OCI, we allow MySQL to perform true semantic similarity search directly in the database using native VECTOR columns and cosine distance. This architecture shows how organizations can build retrieval systems that answer conceptual, subjective, or tactical basketball questionswithout requiring deeply engineered features or manually-curated labels.

Ultimately, this project highlights the growing power and simplicity of vector databases: with only a few hundred lines of Python and SQL, it becomes possible to build a production-ready, concept-aware search engine that works across tens of thousands of records and supports any natural-language query. Whether used for analytics, scouting, recommendation engines, or exploratory research, semantic vector search offers a far more intuitive way to interact with data than traditional keyword or numeric filtering ever could.

References

Cohere. (2024). Cohere Embeddings Documentation. https://docs.cohere.com

FiveThirtyEight. (n.d.). NBA Elo Dataset. GitHub repository. Retrieved from https://github.com/fivethirtyeight/data/tree/master/nba-elo

MySQL. (2024). MySQL 9.0 Reference Manual: Vector Datatype and Vector Functions. Oracle Corporation. https://dev.mysql.com/doc/

Oracle. (2024). Oracle Cloud Infrastructure Generative AI—Embeddings API Documentation. https://docs.oracle.com/en-us/iaas/Content/generative-ai-inference/home.htm

Python Software Foundation. (2024). mysql-connector-python Documentation. https://dev.mysql.com/doc/connector-python/en/

Related posts