Visualising AI Embeddings in APEX

"To deal with hyper-planes in a 14-dimensional space, visualize a 3-D space and say 'fourteen' to yourself very loudly. Everyone does it."   

- Geoffrey Hinton, 2018 Turing Award winner.




Within the wonderful world of Generative AI, one concept that is all the rage is RAG, or Retrieval Augmented Generation, which is an AI framework that combines the strengths of traditional information retrieval systems (such as databases) with the capabilities of generative large language models (LLMs).


RAG's goal is to improve the accuracy, relevance, and timeliness of information generation - such as documents, text and images - by optimizing LLM output. 


When creating a RAG system, it’s essential to store information in a format that a LLM can retrieve. This is where data is converted into embeddings through a process that uses pre-trained machine learning models, typically based on neural networks, to transform raw data (documents, text or images) into high-dimensional vectors.  These vector embeddings describe the semantic meaning behind our raw data and translates semantic similarity of objects into proximity in a mathematical vector space.


Embeddings are incredibly powerful, but given their high dimensionality (with vector sizes typically between 384 and 4096 depending upon the size of the model) they can be hard for humans to comprehend

This is where dimension reduction comes in.


Dimensionality reduction techniques are quantitative methods that simplify information by mapping it from a higher-dimensional space to a smaller set of dimensions. Although there is a trade-off in balancing complexity with interpretability, compressing our embeddings into two or three dimensions allows us the possibility to visualise that semantic 'proximity' within our data.



So how does Oracle 23ai and APEX fit into all this?   


Earlier in 2024, Oracle released 23ai on OCI.  This version of the database includes Oracle AI Vector Search - a rich set of features to enable semantic search, underpinned by the new VECTOR data type.  With native support for vector embedding capabilities now available within the database, getting started with RAG is pretty straightforward.  Using the latest release of APEX on OCI with a 23ai database, enables the AI Vector Search capabilities to be utilised natively.   


This opens up many possibilities with regard to generative AI, but what I particularly wanted to explore in more detail were ways of visualising those vector embeddings to help identify anomalies in the semantic similarity of data, or where there appears to be 'hallucination'.


So far, I've managed to create a process (which I'll go into in more detail below) to:

  • Create vector embeddings for uploaded datasets using the all-MiniLM-L6-v2 sentence transformer model loaded into the database.
  • Use Node.js to carry out Principal Component Analysis (PCA) dimensionality reduction to reduce the (384) dimensions in the vector embeddings down to just 3.
  • Utilise Plotly, a 3rd party (Python & Javascript) graphing library, in APEX to visualise those 3D vectors.


Here's the app if you want to take a look (username/password: demo/demo): Vector Visualisation



3D Scatter Plot


I say 'app', but currently it's just a single page, comprising an Interactive Report and a region containing an interactive 3D scatter plot.  Underpinning this is a small (1500+ rows) data set of animals which I obtained from Hugging Face here.  If you change the filters on the Interactive Report region the 3D scatter plot will refresh with markers for each animal row from the IR, where markers are drawn at the positions from each (3 dimensional) vector stored with each animal row.  You can 'flick' the switch to turn the labels on showing the name of each animal that a marker represents, just note that labels do take a short while to load (something I'm working on improving).  


Anyway, here are the full instructions to get this to all work.  It's a bit of long post so strap yourself in!





1. Set up a 23ai database environment

The easiest way of getting started is to get yourself an Oracle OCI always free Autonomous Database (ADB).  When creating a new ADB choose the 'Transaction Processing' (TP) workload type.  


OCI ADB Workload Types


The 'APEX' workload type ADB has one limitation compared to the 'TP' type in that you can't create a connection to the database, which is needed to complete the dimensionality reduction task below.  (One advantage the 'APEX' type ADB does have over the 'TP' type is no additional steps are needed to expose the APEX URL to the internet, although it's pretty straightforward to setup an OCI Virtual Cloud Network to handle this).


When setting up a new TP ADB, you will create an ADMIN user (there's no SYS access in OCI) which you will need in section 3 below.  You will also need to create a new schema to hold all the other objects and this can be done when setting up the APEX environment along with a workspace to hold the APEX app.


Once the TP Autonomous Database is setup, a one-way TLS (Transport Layer Security) connection will need to be setup so the dimensionality reduction Node.js script can connect.  Follow the instructions here to set this up.




2. Source a Dataset and Upload

As mentioned earlier, I used the animals dataset from Hugging Face as the basis for my demo, as its contains a lot of different classification attributes for each animal, although there was a bit of work needed first so the model could understand some context around the individual characteristics prior to computing the vector embeddings. 


As the dataset was so small I simply used the APEX 'Load Data' utility under 'SQL Workshop -> Utilities -> Data Workshop' to upload it to the database.  I called the table 'ANIMALS' (created in my default schema) and once it was created I added 2 VECTOR type columns to hold both the high and low dimension embeddings:


Vector Columns

Here's the table create script if you want to load the data in a different way:

      
  CREATE TABLE "ANIMALS" 
   ("ID" NUMBER NOT NULL ENABLE, 
	"NAME" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"TAXONOMY_KINGDOM" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"TAXONOMY_PHYLUM" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"TAXONOMY_CLASS" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"TAXONOMY_ORDER" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"TAXONOMY_FAMILY" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"TAXONOMY_GENUS" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"TAXONOMY_SCIENTIFIC_NAME" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_0" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_PREY" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_NAME_OF_YOUNG" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_GROUP_BEHAVIOR" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_ESTIMATED_POPULATION_SIZE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_BIGGEST_THREAT" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_MOST_DISTINCTIVE_FEATURE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_OTHER_NAME_S_" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_GESTATION_PERIOD" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_HABITAT" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_DIET" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AVERAGE_LITTER_SIZE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_LIFESTYLE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_COMMON_NAME" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_NUMBER_OF_SPECIES" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_LOCATION" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_SLOGAN" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_GROUP" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_COLOR" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_SKIN_TYPE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_TOP_SPEED" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_LIFESPAN" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_WEIGHT" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_LENGTH" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AGE_OF_SEXUAL_MATURITY" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AGE_OF_WEANING" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_LITTER_SIZE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_PREDATORS" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_TYPE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_HEIGHT" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_DISTINCTIVE_FEATURE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_TEMPERAMENT" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_TRAINING" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_1" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_2" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_3" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_WINGSPAN" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_INCUBATION_PERIOD" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AGE_OF_FLEDGLING" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AVERAGE_CLUTCH_SIZE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_VENOMOUS" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AGGRESSION" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_MAIN_PREY" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_WATER_TYPE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AGE_OF_INDEPENDENCE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AVERAGE_SPAWN_SIZE" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_NESTING_LOCATION" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_AGE_OF_MOLTING" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_FAVORITE_FOOD" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_ORIGIN" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_4" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_MIGRATORY" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_OPTIMUM_PH_LEVEL" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_5" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_6" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_7" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"CHARACTERISTICS_SPECIAL_FEATURES" VARCHAR2(500) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_8" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"LOCATIONS_9" VARCHAR2(100) COLLATE "USING_NLS_COMP", 
	"VECTOR" VECTOR, 
	"VECTOR3D" VECTOR
   )  DEFAULT COLLATION "USING_NLS_COMP" SEGMENT CREATION IMMEDIATE 
  PCTFREE 10 PCTUSED 40 INITRANS 10 MAXTRANS 255 
 NOCOMPRESS LOGGING
  STORAGE(INITIAL 65536 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0 FREELISTS 1 FREELIST GROUPS 1
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)
  TABLESPACE "DATA" 
 VECTOR ("VECTOR") STORE AS (
  TABLESPACE "DATA"  CHUNK 8192 RETENTION MIN 1800
  STORAGE(INITIAL 262144 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)) 
 VECTOR ("VECTOR3D") STORE AS (
  TABLESPACE "DATA"  CHUNK 8192 RETENTION MIN 1800
  STORAGE(INITIAL 262144 NEXT 1048576 MINEXTENTS 1 MAXEXTENTS 2147483645
  PCTINCREASE 0
  BUFFER_POOL DEFAULT FLASH_CACHE DEFAULT CELL_FLASH_CACHE DEFAULT)) ;







3. Create a View on the Table

I said in the previous section that some additional work on the animals dataset was needed before the work to create the embeddings.  In the view below, I've concatenated the different animal characteristics with some explanatory text so the sentences feel more naturally formed and helps the model with context preservation.  Sending full English sentences allows the model to understand the context and relationships between words, leading to richer and more meaningful embeddings.  Another approach would be to send the individual data fields separately but with this dataset I preferred the sentence method to avoid context loss.

Create the view in the default schema.

    
CREATE OR REPLACE FORCE EDITIONABLE VIEW "ANIMALS_VW" ("ID", "NAME", "TAXONOMY_KINGDOM", "TAXONOMY_PHYLUM", "TAXONOMY_CLASS", "TAXONOMY_ORDER", "TAXONOMY_FAMILY", "TAXONOMY_GENUS", "TAXONOMY_SCIENTIFIC_NAME", "LOCATIONS_0", "CHARACTERISTICS_PREY", "CHARACTERISTICS_NAME_OF_YOUNG", "CHARACTERISTICS_GROUP_BEHAVIOR", "CHARACTERISTICS_ESTIMATED_POPULATION_SIZE", "CHARACTERISTICS_BIGGEST_THREAT", "CHARACTERISTICS_MOST_DISTINCTIVE_FEATURE", "CHARACTERISTICS_OTHER_NAME_S_", "CHARACTERISTICS_GESTATION_PERIOD", "CHARACTERISTICS_HABITAT", "CHARACTERISTICS_DIET", "CHARACTERISTICS_AVERAGE_LITTER_SIZE", "CHARACTERISTICS_LIFESTYLE", "CHARACTERISTICS_COMMON_NAME", "CHARACTERISTICS_NUMBER_OF_SPECIES", "CHARACTERISTICS_LOCATION", "CHARACTERISTICS_SLOGAN", "CHARACTERISTICS_GROUP", "CHARACTERISTICS_COLOR", "CHARACTERISTICS_SKIN_TYPE", "CHARACTERISTICS_TOP_SPEED", "CHARACTERISTICS_LIFESPAN", "CHARACTERISTICS_WEIGHT", "CHARACTERISTICS_LENGTH", "CHARACTERISTICS_AGE_OF_SEXUAL_MATURITY", "CHARACTERISTICS_AGE_OF_WEANING", "CHARACTERISTICS_LITTER_SIZE", "CHARACTERISTICS_PREDATORS", "CHARACTERISTICS_TYPE", "CHARACTERISTICS_HEIGHT", "CHARACTERISTICS_DISTINCTIVE_FEATURE", "CHARACTERISTICS_TEMPERAMENT", "CHARACTERISTICS_TRAINING", "LOCATIONS_1", "LOCATIONS_2", "LOCATIONS_3", "CHARACTERISTICS_WINGSPAN", "CHARACTERISTICS_INCUBATION_PERIOD", "CHARACTERISTICS_AGE_OF_FLEDGLING", "CHARACTERISTICS_AVERAGE_CLUTCH_SIZE", "CHARACTERISTICS_VENOMOUS", "CHARACTERISTICS_AGGRESSION", "CHARACTERISTICS_MAIN_PREY", "CHARACTERISTICS_WATER_TYPE", "CHARACTERISTICS_AGE_OF_INDEPENDENCE", "CHARACTERISTICS_AVERAGE_SPAWN_SIZE", "CHARACTERISTICS_NESTING_LOCATION", "CHARACTERISTICS_AGE_OF_MOLTING", "CHARACTERISTICS_FAVORITE_FOOD", "CHARACTERISTICS_ORIGIN", "LOCATIONS_4", "CHARACTERISTICS_MIGRATORY", "CHARACTERISTICS_OPTIMUM_PH_LEVEL", "LOCATIONS_5", "LOCATIONS_6", "LOCATIONS_7", "CHARACTERISTICS_SPECIAL_FEATURES", "LOCATIONS_8", "LOCATIONS_9", "VECTOR", "VECTOR3D", "DESCRIPTION") DEFAULT COLLATION "USING_NLS_COMP"  AS 
  select a."ID",a."NAME",a."TAXONOMY_KINGDOM",a."TAXONOMY_PHYLUM",a."TAXONOMY_CLASS",a."TAXONOMY_ORDER",a."TAXONOMY_FAMILY",a."TAXONOMY_GENUS",a."TAXONOMY_SCIENTIFIC_NAME",a."LOCATIONS_0",a."CHARACTERISTICS_PREY",a."CHARACTERISTICS_NAME_OF_YOUNG",a."CHARACTERISTICS_GROUP_BEHAVIOR",a."CHARACTERISTICS_ESTIMATED_POPULATION_SIZE",a."CHARACTERISTICS_BIGGEST_THREAT",a."CHARACTERISTICS_MOST_DISTINCTIVE_FEATURE",a."CHARACTERISTICS_OTHER_NAME_S_",a."CHARACTERISTICS_GESTATION_PERIOD",a."CHARACTERISTICS_HABITAT",a."CHARACTERISTICS_DIET",a."CHARACTERISTICS_AVERAGE_LITTER_SIZE",a."CHARACTERISTICS_LIFESTYLE",a."CHARACTERISTICS_COMMON_NAME",a."CHARACTERISTICS_NUMBER_OF_SPECIES",a."CHARACTERISTICS_LOCATION",a."CHARACTERISTICS_SLOGAN",a."CHARACTERISTICS_GROUP",a."CHARACTERISTICS_COLOR",a."CHARACTERISTICS_SKIN_TYPE",a."CHARACTERISTICS_TOP_SPEED",a."CHARACTERISTICS_LIFESPAN",a."CHARACTERISTICS_WEIGHT",a."CHARACTERISTICS_LENGTH",a."CHARACTERISTICS_AGE_OF_SEXUAL_MATURITY",a."CHARACTERISTICS_AGE_OF_WEANING",a."CHARACTERISTICS_LITTER_SIZE",a."CHARACTERISTICS_PREDATORS",a."CHARACTERISTICS_TYPE",a."CHARACTERISTICS_HEIGHT",a."CHARACTERISTICS_DISTINCTIVE_FEATURE",a."CHARACTERISTICS_TEMPERAMENT",a."CHARACTERISTICS_TRAINING",a."LOCATIONS_1",a."LOCATIONS_2",a."LOCATIONS_3",a."CHARACTERISTICS_WINGSPAN",a."CHARACTERISTICS_INCUBATION_PERIOD",a."CHARACTERISTICS_AGE_OF_FLEDGLING",a."CHARACTERISTICS_AVERAGE_CLUTCH_SIZE",a."CHARACTERISTICS_VENOMOUS",a."CHARACTERISTICS_AGGRESSION",a."CHARACTERISTICS_MAIN_PREY",a."CHARACTERISTICS_WATER_TYPE",a."CHARACTERISTICS_AGE_OF_INDEPENDENCE",a."CHARACTERISTICS_AVERAGE_SPAWN_SIZE",a."CHARACTERISTICS_NESTING_LOCATION",a."CHARACTERISTICS_AGE_OF_MOLTING",a."CHARACTERISTICS_FAVORITE_FOOD",a."CHARACTERISTICS_ORIGIN",a."LOCATIONS_4",a."CHARACTERISTICS_MIGRATORY",a."CHARACTERISTICS_OPTIMUM_PH_LEVEL",a."LOCATIONS_5",a."LOCATIONS_6",a."LOCATIONS_7",a."CHARACTERISTICS_SPECIAL_FEATURES",a."LOCATIONS_8",a."LOCATIONS_9",a."VECTOR",a."VECTOR3D"
,   (a.name
    ||' is an animal. '
    ||(case when a.CHARACTERISTICS_COMMON_NAME is not null and upper(a.CHARACTERISTICS_COMMON_NAME) != upper(a.name) then '  Its common name is '||a.CHARACTERISTICS_COMMON_NAME ||'. ' end) 
    ||(case when a.CHARACTERISTICS_OTHER_NAME_S_ is not null then '  Its also often referred to as '||a.CHARACTERISTICS_OTHER_NAME_S_ ||'. ' end) 
    ||(case when a.CHARACTERISTICS_GROUP is not null then '  It is part of the group '||a.CHARACTERISTICS_GROUP ||'. ' end) 
    ||(case when a.CHARACTERISTICS_SKIN_TYPE is not null then '  Its skin type is '||a.CHARACTERISTICS_SKIN_TYPE ||'. ' end) 
    ||(case when a.CHARACTERISTICS_LIFESPAN is not null then '  Its life span is '||a.CHARACTERISTICS_LIFESPAN ||'. ' end) 
    ||(case when a.CHARACTERISTICS_DIET is not null then '  With regard to diet it is a '||a.CHARACTERISTICS_DIET ||'. ' end) 
    ||(case when a.CHARACTERISTICS_HEIGHT is not null then '  Its height is generally '||a.CHARACTERISTICS_HEIGHT ||'. ' end) 
    ||(case when a.CHARACTERISTICS_WEIGHT is not null then '  Its weight is generally '||a.CHARACTERISTICS_WEIGHT ||'. ' end) 
    ||(case when a.CHARACTERISTICS_LENGTH is not null then '  Its length is generally '||a.CHARACTERISTICS_LENGTH ||'. ' end) 
    ||(case when a.CHARACTERISTICS_COLOR is not null then '  It is generally one of the following colors: '||a.CHARACTERISTICS_COLOR ||'. ' end) 
    ||(case when a.CHARACTERISTICS_TOP_SPEED is not null then '  It has a top speed of '||a.CHARACTERISTICS_TOP_SPEED ||'. ' end) 
    ||(case when a.CHARACTERISTICS_TEMPERAMENT is not null then '  Its temperament is '||a.CHARACTERISTICS_TEMPERAMENT ||'. ' end) 
    ||(case when a.CHARACTERISTICS_DISTINCTIVE_FEATURE is not null then '  Its distinctive features are '||a.CHARACTERISTICS_DISTINCTIVE_FEATURE ||'. ' end) 
    ||(case when a.CHARACTERISTICS_GROUP_BEHAVIOR is not null then '  Its group behaviour is '||a.CHARACTERISTICS_GROUP_BEHAVIOR ||'. ' end) 
    ||(case when a.CHARACTERISTICS_BIGGEST_THREAT is not null then '  The biggest threat to its survival is '||a.CHARACTERISTICS_BIGGEST_THREAT ||'. ' end) 
    ||' It belongs to the taxonomy kingdom '||a.TAXONOMY_KINGDOM
    ||', the taxonomy genus '||a.TAXONOMY_GENUS
    ||', the taxonomy family '||a.TAXONOMY_FAMILY 
    ||', the taxonomy order '||a.TAXONOMY_ORDER
    ||', the taxonomy phylum '||a.TAXONOMY_PHYLUM
    ||', and the taxonomy class '||a.TAXONOMY_CLASS  ||'.'
    ||(case when a.TAXONOMY_SCIENTIFIC_NAME is not null then '  Its scientific name is '||a.TAXONOMY_SCIENTIFIC_NAME ||'. ' end) 
    ||' It can be found in the following locations: '
            ||(case when a.CHARACTERISTICS_LOCATION is not null then a.CHARACTERISTICS_LOCATION  end)
            ||(case when a.LOCATIONS_0 is not null then ', '|| a.LOCATIONS_0 end)
            ||(case when a.LOCATIONS_1 is not null then ', '|| a.LOCATIONS_1 end)
            ||(case when a.LOCATIONS_2 is not null then ', '|| a.LOCATIONS_2 end)
            ||(case when a.LOCATIONS_3 is not null then ', '|| a.LOCATIONS_3 end)
            ||(case when a.LOCATIONS_4 is not null then ', '|| a.LOCATIONS_4 end)
            ||(case when a.LOCATIONS_5 is not null then ', '|| a.LOCATIONS_5 end)
            ||(case when a.LOCATIONS_6 is not null then ', '|| a.LOCATIONS_6 end)
            ||(case when a.LOCATIONS_7 is not null then ', '|| a.LOCATIONS_7 end)
            ||(case when a.LOCATIONS_8 is not null then ', '|| a.LOCATIONS_8 end)
            ||(case when a.LOCATIONS_9 is not null then ', '|| a.LOCATIONS_9 end)
    ||'.'
    ||(case when a.CHARACTERISTICS_PREY is not null then '  Its prey includes: '||a.CHARACTERISTICS_PREY ||'. ' end) 
    ||(case when a.CHARACTERISTICS_HABITAT is not null then '  Its habitat is '||a.CHARACTERISTICS_HABITAT ||'. ' end) 
    ||(case when a.CHARACTERISTICS_WATER_TYPE is not null then '  It can be found in '||a.CHARACTERISTICS_WATER_TYPE ||' water type. ' end) 
    ||(case when a.CHARACTERISTICS_FAVORITE_FOOD is not null then '  Its favourite food is '||a.CHARACTERISTICS_FAVORITE_FOOD ||'. ' end) 
    ||(case when a.CHARACTERISTICS_PREDATORS is not null then '  Its predators include: '||a.CHARACTERISTICS_PREDATORS ||'. ' end) 
    ||(case when a.CHARACTERISTICS_NAME_OF_YOUNG is not null then '  The young is called '||a.CHARACTERISTICS_NAME_OF_YOUNG ||'. ' end) 
    ||(case when a.CHARACTERISTICS_GESTATION_PERIOD is not null then '  The gestation period is '||a.CHARACTERISTICS_GESTATION_PERIOD ||'. ' end) 
    ||(case when a.CHARACTERISTICS_AVERAGE_LITTER_SIZE is not null then '  The average litter size is '||a.CHARACTERISTICS_AVERAGE_LITTER_SIZE ||'. ' end) 
    ||(case when a.CHARACTERISTICS_SLOGAN is not null then '  An interesting fact about it is: '||a.CHARACTERISTICS_SLOGAN ||'. ' end) 

    ) as description
from 
    (select  *
    from    animals
    ) a; 




4. Create Vector Embeddings


To create our embeddings we first need to load a vector embedding model into the database.  This needs to be in ONNX (Open Neural Network Exchange ) format and is loaded using the new 23ai DBMS_VECTOR package.  


The model in the download link above is the all-MiniLM-L6-v2 sentence transformer model.  You should be able to get the model already in ONNX format here, but it you want to download the original model and convert it to ONNX yourself (using the Oracle OML4Py utility) then follow these instructions.


(FYI, ONNX is an open-source format for representing deep learning models, and facilitates interoperability between different deep learning frameworks, so a model trained in one framework can easily used in another without model conversion or retraining.)


Run the following (as ADMIN) to grant the privilege needed to create and perform operations on models, plus granting the new 23ai DB_DEVELOPER_ROLE will ensure all other privileges are in place.


GRANT create mining model TO <default schema>;
GRANT db_developer_role TO <default schema>;

Next, a directory needs to be created to hold the data model, and appropriate privileges granted on that directory.  Again as ADMIN run:


CREATE OR REPLACE DIRECTORY data_models as '/home/oracle/models';
GRANT READ, WRITE ON DIRECTORY data_models TO <default schema>;


Now we want to get the data model loaded into the directory. The following will load the ONNX formatted model described in the previous section that you should be able to get here (if it's not available then you will have to get the original model and format it yourself, see above).  Run as ADMIN.


BEGIN 
   DBMS_CLOUD.GET_OBJECT(
     credential_name => 'OCI$RESOURCE_PRINCIPAL',
     object_uri => 'https://objectstorage.eu-frankfurt-1.oraclecloud.com/n/fro8fl9kuqli/b/AIVECTORS/o/all-MiniLM-L6-v2.onnx',
     directory_name => 'DATA_MODELS'); 
END;
/


Once the model is made available in the DATA_MODELS directory, we can then load it to make it available for creating our embeddings.  Run as ADMIN.

    
EXECUTE DBMS_VECTOR.LOAD_ONNX_MODEL('DATA_MODELS','all-MiniLM-L6-v2.onnx','allMiniLML6v2_model');


Finally, we create the vector embeddings (in our default schema): 


UPDATE animals_vw a 
SET a.vector= TO_VECTOR(VECTOR_EMBEDDING(allMiniLML6v2_model USING a.description AS data));    


To check that the model was created correctly run the following query (as ADMIN):


SELECT model_name, mining_function, algorithm, algorithm_type, model_size
FROM user_mining_models
WHERE model_name = 'ALLMINILML62V2_MODEL'
ORDER BY model_name; 

If you're prepared to count the number of embeddings you should find that each vector contains 384 dimensions, which is exactly what we'd expect from the all-MiniLM-L6-v2 model!



5. Perform Dimensionality Reduction

Now that we have the high-dimension vector embeddings sorted, we can look to reduce them down to just 3 dimensions.

I chose to do this part in Node.js using the Node OracleDB driver.  As an alternative you could do this in Python instead using the Python OracleDB driver.

The node-oracledb driver handles the connection with the OCI ADB, but we also need a NPM package that can handle the dimensionality reduction.  This is where the ml-pca NPM package comes into play.    I chose to do the reduction using Principal Component Analysis (PCA) over other techniques (such as t-SNE or UMAP) as it's pretty simple and intuitive, but it can be a bit susceptible to outliers.

Here's the Node.js script.  You will need to set values for a few environment variables: CONNECTDBUSER (default schema name), CONNECTDBPW (default schema password) and CONNECTDBPW (connect string).  The connect string is available in the OCI console under the details of the ADB (click on 'Database Connection').


const oracledb = require('oracledb');
const { PCA } = require('ml-pca');

oracledb.fetchAsString = [ oracledb.CLOB ];

async function reduceDimensions() {
    if (process.argv.length === 2) {
        throw new Error('Expected at least one argument!');
    }

    const ds = process.argv[2]; //table name
    const idCol = process.argv[3] || 'id'; //ID Column
    const highDim = process.argv[4] || 'vector'; //High Dimension Vector Column
    const lowDim = process.argv[5] || 'vector3d'; //Low Dimension Vector Column
    const components = 3; //No of Low Dimensions 

    console.log('Vector Dimension Table: ' + ds);
    console.log('ID Column: ' + idCol);
    console.log('High Dimension Column: ' + highDim);
    console.log('High Dimension Column: ' + lowDim);
    console.log('No of Low Dimensions: ' + components);


    let connection;
    // Add the DB user credentials
    // and connect string
    
    const dbConfig = {
        user: process.env.CONNECTDBUSER,
        password: process.env.CONNECTDBPW,
        connectString: process.env. CONNECTDBPW
}; try { // Get a standalone Oracle Database connection connection = await oracledb.getConnection(dbConfig); //Connect only to Oracle Database 23ai that supports vectors if (connection.oracleServerVersion < 2304000000) { throw new Error('This example requires Oracle Database 23ai or later'); } console.log('Connected to Oracle Database'); console.log('Getting the high dimension vector data for dataset: ' + ds); // Loop over the rows and vectorize the VARCHAR2 data const gdfn = `BEGIN :dim := dimensionReduction.getDimensions(:dataSet, :highDimVectorCol, :idCol); END;`; const gdbinds = { dataSet: ds, highDimVectorCol: highDim, idCol: idCol, dim: { type: oracledb.CLOB, dir: oracledb.BIND_OUT }}; const gdresult = await connection.execute(gdfn, gdbinds); const dim = await gdresult.outBinds.dim.getData(); if (dim === null || dim === undefined ) { throw new Error('No dimensions returned.'); } var dimJSON = JSON.parse(dim); console.log('Reducing to 3 dimensional vectors using PCA (Principal Component Analysis)'); // Use PCA to reduce the vector to 3 dimensions var pca = new PCA(dimJSON, { scale: true }); //console.log(pca.getLoadings()); var dim3d = pca.predict(dimJSON, {nComponents: components}) console.log('PCA dimension reduction complete. Output sample: '); console.log(dim3d); console.log('Converting PCA matrix to a JSON array.'); // Convert the PCA matrix output to a JSON array for sending to the DB var json = dim3d.toJSON(); var dim3dJSON = JSON.stringify(json); console.log('Updating the database table with the new 3d vectors'); // Send the array as a CLOB to the stored procedure which will do the updating const drfn = `BEGIN :sf := dimensionReduction.processPCARows(:dataSet, :lowDimVectorCol, :idCol, :newjs); END;`; const drbinds = {sf: { type: oracledb.STRING, dir: oracledb.BIND_OUT }, dataSet: ds, lowDimVectorCol: lowDim, idCol: idCol, newjs: dim3dJSON} const drresult = await connection.execute(drfn, drbinds); const bl = await drresult.outBinds.sf; //if (bl !== null && bl !== '') { if (bl) { throw new Error('Process to update 3d vectors failed: ' + bl); } console.log('Dimension reduction process now complete.'); } catch (err) { console.error(err); } finally { if (connection) await connection.close(); } } reduceDimensions();

Here's the PL/SQL package called by the Node script (create in the default schema):


The Package Spec:


create or replace package  dimensionReduction 
is
    function getDimensions(pDataTable       in   varchar2,
                           pHighDimVector   in   varchar2   default 'vector',
                           pIDColumn        in   varchar2   default 'id')  
    return clob;

    function processPCARows(pDataTable      in   varchar2,
                            pLowDimVector   in   varchar2   default 'vector3d',
                            pIDColumn       in   varchar2   default 'id',
                            p3DVector       in   clob) 
    return varchar2;
end dimensionReduction;




.... and the Package Body:


create or replace package body dimensionReduction 
is
    
    function getDimensions(pDataTable       in   varchar2,
                           pHighDimVector   in   varchar2   default 'vector',
                           pIDColumn        in   varchar2   default 'id') 
    return clob is
        l_dim  clob;
    begin
        execute immediate 
        'select  JSON_ARRAYAGG(ds.'||sys.DBMS_ASSERT.SIMPLE_SQL_NAME(pHighDimVector)||' order by ds.'||sys.DBMS_ASSERT.SIMPLE_SQL_NAME(pIDColumn)||' returning clob ) dim from '||sys.DBMS_ASSERT.SQL_OBJECT_NAME(pDataTable) ||' ds'
        into l_dim;

        return l_dim;
    exception
        when others  THEN
            return null;
    end getDimensions;





    function processPCARows(pDataTable      in   varchar2,
                            pLowDimVector   in   varchar2   default 'vector3d',
                            pIDColumn       in   varchar2   default 'id',
                            p3DVector       in   clob)
    return varchar2 is
        l_err   varchar2(200); 
    begin

        -- Split the 3d vector arrays and update each 3d vector column
        execute immediate 
        q'! merge into !'||sys.DBMS_ASSERT.SQL_OBJECT_NAME(pDataTable) ||q'! a
        using (
            select  rownum rn,
             to_vector('['||jt.key1||','||jt.key2||','||jt.key3||']', 3, FLOAT64) vector
                from    dual, json_table(:p3DVector format json, '$[*]'
                        columns(
                            key1  PATH '$[0]',
                            key2  PATH '$[1]',
                            key3  PATH '$[2]'
                        )
                    ) 
                jt
        ) vec
        on (a.!'||sys.DBMS_ASSERT.SIMPLE_SQL_NAME(pIDColumn)||q'! = vec.rn)
        when matched then
            update set a.!'||sys.DBMS_ASSERT.SIMPLE_SQL_NAME(pLowDimVector)||q'! = vec.vector !'
        using p3DVector;

        commit;

        return null;

    exception
        when others THEN
            l_err := 'In dimensionReduction.processPCARows procedure when updating '||pDataTable||' dataset: '||substr(sqlerrm, 1, 100);
            return l_err;
    end processPCARows;

end dimensionReduction;

The Node.js script takes a number of command line parameters:

  • The name of the table holding the vector embeddings (mandatory)
  • An ID column for the table (optional, defaults to 'ID')
  • The name of the vector column holding the high dimension embeddings (source) (optional, defaults to 'VECTOR')
  • The name of the vector column holding the low dimension embeddings (target) (optional, defaults to 'VECTOR3D')

Here's an example of how to run on the command line with all parameters being used:


node dimensionReduction.js animals id vector vector3d



6. View the 3D Vector Embeddings.


Finally, we are now in a position where we can utilise the low-dimension vector embeddings with a tool that enables visualisation.  I've chosen APEX for this, as it's my go-to tool of choice, but Oracle Analytics Cloud would be a good choice too.  There's also a nice embedding projector visualiser available on Tensor Flow but you would need to download the 3d vector data to a file first.


The following image demonstrates nicely what I meant by "...semantic similarity of objects into proximity in a mathematical vector space.".  In this example, which shows the 3D scatter plot with markers for mammals of the family type 'Canidae' (which includes domestic dogswolvescoyotesfoxesjackalsdingoes, and many other extant and extinct dog-like mammals), you can see that foxes are grouped in a similar space to the edge of the plot whereas dogs are generally together in the middle of the plot.  There's likely to be outliers amongst the data and that's because we were working with a fairly narrow set of characteristics for each animal and many of those fields had no data present, but in the main I think it's pretty good and I feel it achieves the aim of this post - to reduce complex embeddings into something we can then use to visualise.


The demo APEX app can be viewed using this link: Vector Visualisation (username/password: demo/demo), but if you want to download a copy of it then this it's available here.  This includes a couple of supporting objects that the app needs. 


As mentioned already, its fairly basic at present and needs a bit of work to improve the performance especially around loading the marker labels.


I also plan to make a number of improvements including the use of colour and shapes for the markers, add more datasets, and create new visualisations incl heatmaps.  I'll do another blog post when I have something to show you.


I hope that this has been helpful!


Comments

Popular posts from this blog

Fun with Bubbles - Part 2

Oracle Select AI and APEX