This Banner is For Sale !!
Get your ad here for a week in 20$ only and get upto 15k traffic Daily!!!

Parsing logs from multiple data sources with Ahana and Cube


Ahana gives managed Presto clusters working in your AWS account.

Presto is an open-source distributed SQL question engine, initially developed at Fb, now hosted below the Linux Basis. It connects to a number of databases or different information sources (for instance, Amazon S3). We are able to use a Presto cluster as a single compute engine for a whole information lake.

Presto implements the information federation function: you possibly can course of information from a number of sources as in the event that they had been saved in a single database. Due to that, you don’t want a separate ETL (Extract-Remodel-Load) pipeline to organize the information earlier than utilizing it. Nevertheless, working and configuring a single-point-of-access for a number of databases (or file methods) requires Ops expertise and an extra effort.

Nevertheless, no information engineer desires to do the Ops work. Utilizing Ahana, you possibly can deploy a Presto cluster inside minutes with out spending hours configuring the service, VPCs, and AWS entry rights. Ahana hides the burden of infrastructure administration and lets you deal with processing your information.



What’s Dice?

Cube is a headless BI platform for accessing, organizing, and delivering information. Dice connects to many information warehouses, databases, or question engines, together with Presto, and lets you rapidly construct information functions or analyze your information in BI instruments. It serves as the only supply of fact for what you are promoting metrics.

Cube is headless BI

This text will reveal the caching performance, entry management, and suppleness of the information retrieval API.



Integration

Dice’s battle-tested Presto driver gives the out-of-the-box connectivity to Ahana.

You simply want to supply the credentials: Presto host title and port, consumer title and password, Presto catalog and schema. You will additionally must set CUBEJS_DB_SSL to true since Ahana has secures Presto connections with SSL.

Check the docs to study extra about connecting Dice to Ahana.



Instance: Parsing logs from a number of information sources with Ahana and Dice

Let’s construct a real-world information utility with Ahana and Dice.

We’ll use Ahana to hitch Amazon Sagemaker Endpoint logs saved as JSON information in S3 with the information retrieved from a PostgreSQL database.

Suppose you’re employed at a software program home specializing in coaching ML fashions to your purchasers and delivering ML inference as a REST API. You’ve got simply skilled new variations of all fashions, and also you want to reveal the enhancements to the purchasers.

Due to that, you do a canary deployment of the variations and collect the predictions from the brand new and the outdated fashions utilizing the built-in logging performance of AWS Sagemaker Endpoints: a managed deployment setting for machine studying fashions. Moreover, you additionally monitor the precise manufacturing values offered by your purchasers.

You want all of that to organize personalised dashboards exhibiting the outcomes of your onerous work.

Allow us to present you the way Ahana and Dice work collectively that can assist you obtain your aim rapidly with out spending days studying cryptic documentation.

You’ll retrieve the prediction logs from an S3 bucket and merge them with the precise values saved in a PostgreSQL database. After that, you calculate the ML efficiency metrics, implement entry management, and conceal the information supply complexity behind an easy-to-use REST API.

Architecture diagram

In the long run, you desire a dashboard wanting like this:

The final result: two dashboards showing the number of errors made by two variants of the ML model



Learn how to configure Ahana?



Permitting Ahana to entry your AWS account

First, let’s login to Ahana, and join it to your AWS account. We should create an IAM position permitting Ahana to entry our AWS account.

On the setup web page, click on the “Open CloudFormation” button. After clicking the button, we get redirected to the AWS web page for creating a brand new CloudFormation stack from a template offered by Ahana. Create the stack and wait till CloudFormation finishes the setup.

When the IAM position is configured, click on the stack’s Outputs tab and replica the AhanaCloudProvisioningRole key worth.

The Outputs tab containing the identifier of the IAM role for Ahana

We’ve got to stick it into the Function ARN subject on the Ahana setup web page and click on the “Full Setup” button.

The Ahana setup page



Creating an Ahana cluster

After configuring AWS entry, we have now to begin a brand new Ahana cluster.

Within the Ahana dashboard, click on the “Create new cluster” button.

Click the “Create new cluster” button displayed here

Within the setup window, we will configure the kind of the AWS EC2 cases utilized by the cluster, scaling technique, and the Hive Metastore. In the event you want an in depth description of the configuration choices, take a look at the “Create new cluster” part of the Ahana documentation.

Ahana cluster setup page

Keep in mind so as to add at the least one consumer to your cluster! After we are glad with the configuration, we will click on the “Create cluster” button. Ahana wants round 20-Half-hour to setup a brand new cluster.



Retrieving information from S3 and PostgreSQL

After deploying a Presto cluster, we have now to attach our information sources to the cluster as a result of, on this instance, the Sagemaker Endpoint logs are saved in S3 and PostgreSQL.



Including a PostgreSQL database to Ahana

Within the Ahana dashboard, click on the “Add new information supply” button. We’ll see a web page exhibiting all supported information sources. Let’s click on the “Amazon RDS for PostgreSQL” choice.

Within the setup kind displayed beneath, we have now to supply the database configuration and click on the “Add information supply” button.

PostgreSQL data source configuration



Including an S3 bucket to Ahana

AWS Sagemaker Endpoint shops their logs in an S3 bucket as JSON information. To entry these information in Presto, we have to configure the AWS Glue information catalog and add the information catalog to the Ahana cluster.

We’ve got to login to the AWS console, open the AWS Glue web page and add a brand new database to the information catalog (or use an present one).

AWS Glue databases

Now, let’s add a brand new desk. We received’t configure it manually. As a substitute, let’s create a Glue crawler to generate the desk definition robotically. On the AWS Glue web page, we have now to click on the “Crawlers” hyperlink and click on the “Add crawler” button.

AWS Glue crawlers

After typing the crawler’s title and clicking the “Subsequent” button, we are going to see the Supply Sort web page. On this web page, we have now to decide on the “Information shops” and “Crawl all folders” (in our case, “Crawl new folders solely” would work too).

Here we specify where the crawler should look for new data

On the “Information retailer” web page, we decide the S3 information retailer, choose the S3 connection (or click on the “Add connection” button if we don’t have an S3 connection configured but), and specify the S3 path.

Observe that Sagemaker Endpoints retailer logs in subkeys utilizing the next key construction: endpoint-name/model-variant/12 months/month/day/hour. We wish to use these components of the important thing as desk partitions.

Due to that, if our Sagemaker logs have an S3 key: s3://the_bucket_name/sagemaker/logs/endpoint-name/model-variant-name/12 months/month/day/hour, we put solely the s3://the_bucket_name/sagemaker/logs key prefix within the setup window!

screely-1654578806978.png

Let’s click on the “Subsequent” button. Within the subsequent window, we select “No” when requested whether or not we wish to configure one other information supply. Glue setup will ask in regards to the title of the crawler’s IAM position. We are able to create a brand new one:

IAM role configuration

Subsequent, we configure the crawler’s schedule. A Sagemaker Endpoint provides new log information in close to real-time. Due to that, it is smart to scan the information and add new partitions each hour:

Configuring the crawler’s schedule

Within the output configuration, we have to customise the settings.

First, let’s choose the Glue database the place the brand new tables get saved. After that, we modify the “Configuration choices.”

We decide the “Add new columns solely” as a result of we are going to make guide adjustments within the desk definition, and we don’t need the crawler to overwrite them. Additionally, we wish to add new partitions to the desk, so we examine the “Replace all new and present partitions with metadata from the desk.” field.

Crawler’s output configuration

Let’s click on “Subsequent.” We are able to examine the configuration yet another time within the assessment window and click on the “End” button.

Now, we will wait till the crawler runs or open the AWS Glue Crawlers view and set off the run manually. When the crawler finishes working, we go to the Tables view in AWS Glue and click on the desk title.

AWS Glue tables

Within the desk view, we click on the “Edit desk” button and alter the “Serde serialization lib” to “org.apache.hive.hcatalog.information.JsonSerDe” as a result of the AWS JSON serialization library isn’t out there within the Ahana Presto cluster.

JSON serialization configured in the table details view

We must also click on the “Edit schema” button and alter the default partition names to values proven within the screenshot beneath:

Default partition names replaced with their actual names

After saving the adjustments, we will add the Glue information catalog to our Ahana Presto cluster.



Configuring information sources within the Presto cluster

Return to the Ahana dashboard and click on the “Add information supply” button. Choose the “AWS Glue Information Catalog for Amazon S3” choice within the setup kind.

AWS Glue data catalog setup in Ahana

Let’s choose our AWS area and put the AWS account id within the “Glue Information Catalog ID” subject. After that, we click on the “Open CloudFormation” button and apply the template. We must wait till CloudFormation creates the IAM position.

When the position is prepared, we copy the position ARN from the Outputs tab and paste it into the “Glue/S3 Function ARN” subject:

The “Outputs” tab shows the ARN of the IAM role used to access the Glue data catalog from Ahana

On the Ahana setup web page, we click on the “Add information supply” button.



Including information sources to an present cluster

Lastly, we will add each information sources to our Ahana cluster.

We’ve got to open the Ahana “Clusters” web page, click on the “Handle” button, and scroll all the way down to the “Information Sources” part. On this part, we click on the “Handle information sources” button.

We’ll see one other setup web page the place we examine the packing containers subsequent to the information sources we wish to configure and click on the “Modify cluster” button. We might want to verify that we wish to restart the cluster to make the adjustments.

Adding data sources to an Ahana cluster



Writing the Presto queries

Earlier than configuring Dice, let’s write the Presto queries to retrieve the information we would like.

The precise construction of the enter and output from an AWS Sagemaker Endpoint is dependent upon us. We are able to ship any JSON request and return a customized JSON object.

Let’s assume that our endpoint receives a request containing the enter information for the machine studying mannequin and a correlation id. We’ll want these ids to hitch the mannequin predictions with the precise information.

Instance enter:

{"time_series": [51, 37, , 7], "correlation_id": "cf8b7b9a-6b8a-45fe-9814-11a4b17c710a"}
Enter fullscreen mode

Exit fullscreen mode

Within the response, the mannequin returns a JSON object with a single “prediction” key and a decimal worth:

{"prediction": 21.266147618448954}

Enter fullscreen mode

Exit fullscreen mode

A single request in Sagemaker Endpoint logs seems like this:

{"captureData": {"endpointInput": {"observedContentType": "utility/json", "mode": "INPUT", "information": "eyJ0aW1lX3NlcmllcyI6IFs1MS40MjM5MjAzODYxNTAzODUsIDM3LjUwOTk2ODc2MTYwNzM0LCAzNi41NTk4MzI2OTQ0NjAwNTYsIDY0LjAyMTU3MzEyNjYyNDg0LCA2MC4zMjkwMzU2MDgyMjIwODUsIDIyLjk1MDg0MjgxNDg4MzExLCA0NC45MjQxNTU5MTE1MTQyOCwgMzkuMDM1NzA4Mjg4ODc2ODA1LCAyMC44NzQ0Njk2OTM0MzAxMTUsIDQ3Ljc4MzY3MDQ3MjI2MDI1NSwgMzcuNTgxMDYzNzUyNjY5NTE1LCA1OC4xMTc2MzQ5NjE5NDM4OCwgMzYuODgwNzExNTAyNDIxMywgMzkuNzE1Mjg4NTM5NzY5ODksIDUxLjkxMDYxODYyNzg0ODYyLCA0OS40Mzk4MjQwMTQ0NDM2OCwgNDIuODM5OTA5MDIxMDkwMzksIDI3LjYwOTU0MTY5MDYyNzkzLCAzOS44MDczNzU1NDQwODYyOCwgMzUuMTA2OTQ4MzI5NjQwOF0sICJjb3JyZWxhdGlvbl9pZCI6ICJjZjhiN2I5YS02YjhhLTQ1ZmUtOTgxNC0xMWE0YjE3YzcxMGEifQ==", "encoding": "BASE64"}, "endpointOutput": {"observedContentType": "utility/json", "mode": "OUTPUT", "information": "eyJwcmVkaWN0aW9uIjogMjEuMjY2MTQ3NjE4NDQ4OTU0fQ==", "encoding": "BASE64"}}, "eventMetadata": {"eventId": "b409a948-fbc7-4fa6-8544-c7e85d1b7e21", "inferenceTime": "2022-05-06T10:23:19Z"}
Enter fullscreen mode

Exit fullscreen mode

AWS Sagemaker Endpoints encode the request and response utilizing base64. Our question must decode the information earlier than we will course of it. Due to that, our Presto question begins with information decoding:

with sagemaker as (
  choose
  model_name,
  variant_name,
  forged(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.information)), '$.correlation_id') as varchar) as correlation_id,
  forged(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.information)), '$.prediction') as double) as prediction
  from s3.sagemaker_logs.logs
)
, precise as (
  choose correlation_id, actual_value
  from postgresql.public.actual_values
)
Enter fullscreen mode

Exit fullscreen mode

After that, we be part of each information sources and calculate absolutely the error worth:

, logs as (
  choose model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as precise
  from sagemaker
  left outer be part of precise
  on sagemaker.correlation_id = precise.correlation_id
)
, errors as (
  choose abs(prediction - precise) as abs_err, model_name, model_variant from logs
),
Enter fullscreen mode

Exit fullscreen mode

Now, we have to calculate the percentiles utilizing the approx_percentile operate. Observe that we group the percentiles by mannequin title and mannequin variant. Due to that, Presto will produce solely a single row per each model-variant pair. That’ll be vital after we write the second a part of this question.

percentiles as (
  choose approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)
Enter fullscreen mode

Exit fullscreen mode

Within the remaining a part of the question, we are going to use the filter expression to depend the variety of values inside buckets. Moreover, we return the bucket boundaries. We have to use an combination operate max (or another combination operate) due to the group by clause. That received’t have an effect on the consequence as a result of we returned a single row per each model-variant pair within the earlier question.

SELECT depend(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, depend(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, depend(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, depend(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, depend(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, depend(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, depend(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, depend(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, depend(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, depend(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant
Enter fullscreen mode

Exit fullscreen mode



Learn how to configure Dice?

In our utility, we wish to show the distribution of absolute prediction errors.

We could have a chart exhibiting the distinction between the precise worth and the mannequin’s prediction. Our chart will break up absolutely the errors into buckets (percentiles) and show the variety of errors inside each bucket.

If the brand new variant of the mannequin performs higher than the present mannequin, we should always see fewer massive errors within the charts. An ideal (and unrealistic) mannequin would produce a single error bar within the left-most a part of the chart with the “0” label.

At the start of the article, we checked out an instance chart that exhibits no important distinction between each mannequin variants:

Both models perform almost the same

If the variant B had been higher than the variant A, its chart might appear like this (notice the axis values in each footage):

An improved second version of the model



Making a Dice deployment

Cube Cloud is the simplest approach to get began with Dice. It gives a completely managed, prepared to make use of Dice cluster. Nevertheless, when you desire self-hosting, then comply with this tutorial.

First, please create a brand new Cube Cloud deployment. Then, open the “Deployments” web page and click on the “Create deployment” button.

Cube Deployments dashboard page

We select the Presto cluster:

Database connections supported by Cube

Lastly, we fill out the connection parameters and click on the “Apply” button. Keep in mind to allow the SSL connection!

Presto configuration page



Defining the information mannequin in Dice

We’ve got our queries able to copy-paste, and we have now configured a Presto connection in Dice. Now, we will outline the Dice schema to retrieve question outcomes.

Let’s open the Schema view in Dice and add a brand new file.

The schema view in Cube showing where we should click to create a new file.

Within the subsequent window, sort the file title errorpercentiles.js and click on “Create file.”

The “Add a new file” window.

Within the following paragraphs, we are going to clarify components of the configuration and present you code fragments to copy-paste. You don’t have to try this in such small steps!

Under, you see all the content material of the file. Later, we clarify the configuration parameters.

const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];

const measures = Object.keys(measureNames).cut back((consequence, title) => {
  const sqlName = measureNames[name];
  return {
    ...consequence,
    [sqlName]: {
      sql: () => sqlName,
      sort: `max`
    }
  };
}, {});

dice('errorpercentiles', {
  sql: `with sagemaker as (
    choose
    model_name,
    variant_name,
    forged(json_extract(FROM_UTF8( from_base64(capturedata.endpointinput.information)), '$.correlation_id') as varchar) as correlation_id,
    forged(json_extract(FROM_UTF8( from_base64(capturedata.endpointoutput.information)), '$.prediction') as double) as prediction
    from s3.sagemaker_logs.logs
  )
, precise as (
  choose correlation_id, actual_value
  from postgresql.public.actual_values
)
, logs as (
  choose model_name, variant_name as model_variant, sagemaker.correlation_id, prediction, actual_value as precise
  from sagemaker
  left outer be part of precise
  on sagemaker.correlation_id = precise.correlation_id
)
, errors as (
  choose abs(prediction - precise) as abs_err, model_name, model_variant from logs
),
percentiles as (
  choose approx_percentile(abs_err, 0.1) as perc_10,
  approx_percentile(abs_err, 0.2) as perc_20,
  approx_percentile(abs_err, 0.3) as perc_30,
  approx_percentile(abs_err, 0.4) as perc_40,
  approx_percentile(abs_err, 0.5) as perc_50,
  approx_percentile(abs_err, 0.6) as perc_60,
  approx_percentile(abs_err, 0.7) as perc_70,
  approx_percentile(abs_err, 0.8) as perc_80,
  approx_percentile(abs_err, 0.9) as perc_90,
  approx_percentile(abs_err, 1.0) as perc_100,
  model_name,
  model_variant
  from errors
  group by model_name, model_variant
)
SELECT depend(*) FILTER (WHERE e.abs_err <= perc_10) AS perc_10
, max(perc_10) as perc_10_value
, depend(*) FILTER (WHERE e.abs_err > perc_10 and e.abs_err <= perc_20) AS perc_20
, max(perc_20) as perc_20_value
, depend(*) FILTER (WHERE e.abs_err > perc_20 and e.abs_err <= perc_30) AS perc_30
, max(perc_30) as perc_30_value
, depend(*) FILTER (WHERE e.abs_err > perc_30 and e.abs_err <= perc_40) AS perc_40
, max(perc_40) as perc_40_value
, depend(*) FILTER (WHERE e.abs_err > perc_40 and e.abs_err <= perc_50) AS perc_50
, max(perc_50) as perc_50_value
, depend(*) FILTER (WHERE e.abs_err > perc_50 and e.abs_err <= perc_60) AS perc_60
, max(perc_60) as perc_60_value
, depend(*) FILTER (WHERE e.abs_err > perc_60 and e.abs_err <= perc_70) AS perc_70
, max(perc_70) as perc_70_value
, depend(*) FILTER (WHERE e.abs_err > perc_70 and e.abs_err <= perc_80) AS perc_80
, max(perc_80) as perc_80_value
, depend(*) FILTER (WHERE e.abs_err > perc_80 and e.abs_err <= perc_90) AS perc_90
, max(perc_90) as perc_90_value
, depend(*) FILTER (WHERE e.abs_err > perc_90 and e.abs_err <= perc_100) AS perc_100
, max(perc_100) as perc_100_value
, p.model_name, p.model_variant
FROM percentiles p, errors e group by p.model_name, p.model_variant`,

preAggregations: {
// Pre-Aggregations definitions go right here
// Be taught extra right here: https://dice.dev/docs/caching/pre-aggregations/getting-started
},

joins: {
},

measures: measures,
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    sort: 'string'
  },
  modelName: {
    sql: `model_name`,
    sort: 'string'
  },
}
});
Enter fullscreen mode

Exit fullscreen mode

Within the sql property, we put the question ready earlier. Observe that your question MUST NOT include a semicolon.

A newly created cube configuration file.

We’ll group and filter the values by the mannequin and variant names, so we put these columns within the dimensions part of the dice configuration. The remainder of the columns are going to be our measurements. We are able to write them out one after the other like this:

measures: {
  perc_10: {
    sql: `perc_10`,
    sort: `max`
  },
  perc_20: {
    sql: `perc_20`,
    sort: `max`
  },
  perc_30: {
    sql: `perc_30`,
    sort: `max`
  },
  perc_40: {
    sql: `perc_40`,
    sort: `max`
  },
  perc_50: {
    sql: `perc_50`,
    sort: `max`
  },
  perc_60: {
    sql: `perc_60`,
    sort: `max`
  },
  perc_70: {
    sql: `perc_70`,
    sort: `max`
  },
  perc_80: {
    sql: `perc_80`,
    sort: `max`
  },
  perc_90: {
    sql: `perc_90`,
    sort: `max`
  },
  perc_100: {
    sql: `perc_100`,
    sort: `max`
  },
  perc_10_value: {
    sql: `perc_10_value`,
    sort: `max`
  },
  perc_20_value: {
    sql: `perc_20_value`,
    sort: `max`
  },
  perc_30_value: {
    sql: `perc_30_value`,
    sort: `max`
  },
  perc_40_value: {
    sql: `perc_40_value`,
    sort: `max`
  },
  perc_50_value: {
    sql: `perc_50_value`,
    sort: `max`
  },
  perc_60_value: {
    sql: `perc_60_value`,
    sort: `max`
  },
  perc_70_value: {
    sql: `perc_70_value`,
    sort: `max`
  },
  perc_80_value: {
    sql: `perc_80_value`,
    sort: `max`
  },
  perc_90_value: {
    sql: `perc_90_value`,
    sort: `max`
  },
  perc_100_value: {
    sql: `perc_100_value`,
    sort: `max`
  }
},
dimensions: {
  modelVariant: {
    sql: `model_variant`,
    sort: 'string'
  },
  modelName: {
    sql: `model_name`,
    sort: 'string'
  },
}
Enter fullscreen mode

Exit fullscreen mode

A part of the error percentiles configuration in Cube

The notation we have now proven you has a lot of repetition and is kind of verbose. We are able to shorten the measurements outlined within the code through the use of JavaScript to generate them.

We had so as to add the next code earlier than utilizing the dice operate!

First, we have now to create an array of column names:

const measureNames = [
  'perc_10', 'perc_10_value',
  'perc_20', 'perc_20_value',
  'perc_30', 'perc_30_value',
  'perc_40', 'perc_40_value',
  'perc_50', 'perc_50_value',
  'perc_60', 'perc_60_value',
  'perc_70', 'perc_70_value',
  'perc_80', 'perc_80_value',
  'perc_90', 'perc_90_value',
  'perc_100', 'perc_100_value',
];
Enter fullscreen mode

Exit fullscreen mode

Now, we should generate the measures configuration object. We iterate over the array and create a measure configuration for each column:

const measures = Object.keys(measureNames).cut back((consequence, title) => {
  const sqlName = measureNames[name];
  return {
    ...consequence,
    [sqlName]: {
      sql: () => sqlName,
      sort: `max`
    }
  };
}, {});
Enter fullscreen mode

Exit fullscreen mode

Lastly, we will exchange the measure definitions with:

measures: measures,
Enter fullscreen mode

Exit fullscreen mode

After altering the file content material, click on the “Save All” button.

The top section of the schema view.

And click on the Proceed button within the popup window.

The popup window shows the URL of the test API.

Within the Playground view, we will take a look at our question by retrieving the chart information as a desk (or one of many built-in charts):

An example result in the Playground view.



Configuring entry management in Dice

Within the Schema view, open the dice.js file.

We’ll use the queryRewrite configuration choice to permit or disallow entry to information.

First, we are going to reject all API calls with out the fashions subject within the securityContext. We’ll put the identifier of the fashions the consumer is allowed to see of their JWT token. The safety context incorporates the entire JWT token variables.

For instance, we will ship a JWT token with the next payload. In fact, within the utility sending queries to Dice, we should examine the consumer’s entry proper and set the suitable token payload. Authentication and authorization are past the scope of this tutorial, however please don’t overlook about them.

The Security Context window in the Playground view

After rejecting unauthorized entry, we add a filter to all queries.

We are able to distinguish between the datasets accessed by the consumer by wanting on the information specified within the question. We have to do it as a result of we should filter by the modelName property of the proper desk.

In our queryRewrite configuration within the dice.js file, we use the question.filter.push operate so as to add a modelName IN (model_1, model_2, …) clause to the SQL question:

module.exports = {
  queryRewrite: (question, { securityContext }) => {
    if (!securityContext.fashions) {
      throw new Error('No fashions present in Safety Context!');
    }
    question.filters.push({
      member: 'percentiles.modelName',
      operator: 'in',
      values: securityContext.fashions,
    });
    return question;
  },
};
Enter fullscreen mode

Exit fullscreen mode



Configuring caching in Dice

By default, Cube caches all Presto queries for two minutes. Despite the fact that Sagemaker Endpoints shops logs in S3 in close to real-time, we aren’t thinking about refreshing the information so usually. Sagemaker Endpoints retailer the logs in JSON information, so retrieving the metrics requires a full scan of all information within the S3 bucket.

After we collect logs over a very long time, the question could take a while. Under, we are going to present you the right way to configure the caching in Dice. We advocate doing it when the end-user utility wants over one second to load the information.

For the sake of the instance, we are going to retrieve the worth solely twice a day.



Getting ready information sources for caching

First, we should permit Presto to retailer information in each PostgreSQL and S3. It’s required as a result of, within the case of Presto, Dice helps solely the simple pre-aggregation strategy. Due to this fact, we have to pre-aggregate the information within the supply databases earlier than loading them into Dice.

In PostgreSQL, we grant permissions to the consumer account utilized by Presto to entry the database:

GRANT CREATE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
GRANT USAGE ON SCHEMA the_schema_we_use TO the_user_used_in_presto;
Enter fullscreen mode

Exit fullscreen mode

If we haven’t modified something within the AWS Glue information catalog, Presto already has permission to create new tables and retailer their information in S3, however the schema doesn’t include the goal S3 location but, so all requests will fail.

We should login to AWS Console, open the Glue information catalog, and create a brand new database known as prod_pre_aggregations. Within the database configuration, we should specify the S3 location for the desk content material.

If you wish to use a unique database title, comply with the directions in our documentation.

Adding a new database to AWS Glue data catalog



Caching configuration in Dice

Let’s open the errorpercentiles.js schema file. Under the SQL question, we put the preAggregations configuration:

preAggregations: {
  cacheResults: {
    sort: `rollup`,
    measures: [
      errorpercentiles.perc_10, errorpercentiles.perc_10_value,
      errorpercentiles.perc_20, errorpercentiles.perc_20_value,
      errorpercentiles.perc_30, errorpercentiles.perc_30_value,
      errorpercentiles.perc_40, errorpercentiles.perc_40_value,
      errorpercentiles.perc_50, errorpercentiles.perc_50_value,
      errorpercentiles.perc_60, errorpercentiles.perc_60_value,
      errorpercentiles.perc_70, errorpercentiles.perc_70_value,
      errorpercentiles.perc_80, errorpercentiles.perc_80_value,
      errorpercentiles.perc_90, errorpercentiles.perc_90_value,
      errorpercentiles.perc_100, errorpercentiles.perc_100_value
    ],
    dimensions: [errorpercentiles.modelName, errorpercentiles.modelVariant],
    refreshKey: {
      each: `12 hour`,
    },
  },
},
Enter fullscreen mode

Exit fullscreen mode

After testing the event model, we will additionally deploy the adjustments to manufacturing utilizing the “Commit & Push” button. After we click on it, we might be requested to sort the commit message:

An empty “Commit Changes & Push” view.

After we commit the adjustments, the deployment of a brand new model of the endpoint will begin. A couple of minutes later, we will begin sending queries to the endpoint.

We are able to additionally examine the pre-aggregations window to confirm whether or not Dice efficiently created the cached information.

Successfully cached pre-aggregations

Now, we will transfer to the Playground tab and run our question. We should always see the “Question was accelerated with pre-aggregation” message if Dice used the cached values to deal with the request.

The message that indicates that our pre-aggregation works correctly



Constructing the front-end utility

Dice can hook up with a variety of tools, together with Jupyter Notebooks, Superset, and Hex. Nevertheless, we would like a completely customizable dashboard, so we are going to construct a front-end utility.

Our dashboard consists of two components: the web site and the back-end service. Within the net half, we could have solely the code required to show the charts. Within the back-end, we are going to deal with authentication and authorization. The backend service may also ship requests to the Dice REST API.



Getting the Dice API key and the API URL

Earlier than we begin, we have now to repeat the Dice API secret. Open the settings web page in Dice Cloud’s net UI and click on the “Env vars” tab. Within the tab, you will note the entire Dice configuration variables. Click on the attention icon subsequent to the CUBEJS_API_SECRET and replica the worth.

The Env vars tab on the settings page.

We additionally want the URL of the Dice endpoint. To get this worth, click on the “Copy API URL” hyperlink within the prime proper nook of the display screen.

The location of the Copy API URL link.



Again finish for entrance finish

Now, we will write the back-end code.

First, we have now to authenticate the consumer. We assume that you’ve an authentication service that verifies whether or not the consumer has entry to your dashboard and which fashions they’ll entry. In our examples, we count on these mannequin names in an array saved within the allowedModels variable.

After getting the consumer’s credentials, we have now to generate a JWT to authenticate Dice requests. Observe that we have now additionally outlined a variable for storing the CUBE_URL. Put the URL retrieved within the earlier step as its worth.

​​const jwt = require('jsonwebtoken');
CUBE_URL = '';
operate create_cube_token() {
  const CUBE_API_SECRET = your_token; // Don’t retailer it within the code!!!
  // Cross it as an setting variable at runtime or use the
  // secret administration function of your container orchestration system

  const cubejsToken = jwt.signal(
    { "fashions": allowedModels },
    CUBE_API_SECRET,
    { expiresIn: '30d' }
  );

  return cubejsToken;
}
Enter fullscreen mode

Exit fullscreen mode

We’ll want two endpoints in our back-end service: the endpoint returning the chart information and the endpoint retrieving the names of fashions and variants we will entry.

We create a brand new specific utility working within the node server and configure the /fashions endpoint:

const request = require('request');
const specific = require('specific')
const bodyParser = require('body-parser')
const port = 5000;
const app = specific()

app.use(bodyParser.json())
app.get('/fashions', getAvailableModels);

app.hear(port, () => {
  console.log(`Server is working on port ${port}`)
})
Enter fullscreen mode

Exit fullscreen mode

Within the getAvailableModels operate, we question the Dice Cloud API to get the mannequin names and variants. It would return solely the fashions we’re allowed to see as a result of we have now configured the Dice safety context:

Our operate returns an inventory of objects containing the modelName and modelVariant fields.

operate getAvailableModels(req, res) {
  res.setHeader('Content material-Sort', 'utility/json');
  request.publish(CUBE_URL + '/load', {
    headers: {
      'Authorization': create_cube_token(),
      'Content material-Sort': 'utility/json'
    },
    physique: JSON.stringify({"question": {
      "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
      ],
      "timeDimensions": [],
      "order": {
        "errorpercentiles.modelName": "asc"
      }
    }})
  }, (err, res_, physique) => {
    if (err) {
      console.log(err);
    }
    physique = JSON.parse(physique);
    response = physique.information.map(merchandise => {
      return {
        modelName: merchandise["errorpercentiles.modelName"],
        modelVariant: merchandise["errorpercentiles.modelVariant"]
      }
    });
    res.ship(JSON.stringify(response));
  });
};
Enter fullscreen mode

Exit fullscreen mode

Let’s retrieve the percentiles and percentile buckets. To simplify the instance, we are going to present solely the question and the response parsing code. The remainder of the code stays the identical as within the earlier endpoint.

The question specifies all measures we wish to retrieve and units the filter to get information belonging to a single mannequin’s variant. We might retrieve all information directly, however we do it one after the other for each variant.

{
  "question": {
    "measures": [
      "errorpercentiles.perc_10",
      "errorpercentiles.perc_20",
      "errorpercentiles.perc_30",
      "errorpercentiles.perc_40",
      "errorpercentiles.perc_50",
      "errorpercentiles.perc_60",
      "errorpercentiles.perc_70",
      "errorpercentiles.perc_80",
      "errorpercentiles.perc_90",
      "errorpercentiles.perc_100",
      "errorpercentiles.perc_10_value",
      "errorpercentiles.perc_20_value",
      "errorpercentiles.perc_30_value",
      "errorpercentiles.perc_40_value",
      "errorpercentiles.perc_50_value",
      "errorpercentiles.perc_60_value",
      "errorpercentiles.perc_70_value",
      "errorpercentiles.perc_80_value",
      "errorpercentiles.perc_90_value",
      "errorpercentiles.perc_100_value"
    ],
    "dimensions": [
        "errorpercentiles.modelName",
        "errorpercentiles.modelVariant"
    ],
    "filters": [
      {
        "member": "errorpercentiles.modelName",
        "operator": "equals",
        "values": [
          req.query.model
        ]
      },
      {
        "member": "errorpercentiles.modelVariant",
        "operator": "equals",
        "values": [
          req.query.variant
        ]
      }
    ]
  }
}
Enter fullscreen mode

Exit fullscreen mode

The response parsing code extracts the variety of values in each bucket and prepares bucket labels:

response = physique.information.map(merchandise => {
  return {
    modelName: merchandise["errorpercentiles.modelName"],
    modelVariant: merchandise["errorpercentiles.modelVariant"],
    labels: [
      "<=" + item['percentiles.perc_10_value'],
      merchandise['errorpercentiles.perc_20_value'],
      merchandise['errorpercentiles.perc_30_value'],
      merchandise['errorpercentiles.perc_40_value'],
      merchandise['errorpercentiles.perc_50_value'],
      merchandise['errorpercentiles.perc_60_value'],
      merchandise['errorpercentiles.perc_70_value'],
      merchandise['errorpercentiles.perc_80_value'],
      merchandise['errorpercentiles.perc_90_value'],
      ">=" + merchandise['errorpercentiles.perc_100_value']
    ],
    values: [
      item['errorpercentiles.perc_10'],
      merchandise['errorpercentiles.perc_20'],
      merchandise['errorpercentiles.perc_30'],
      merchandise['errorpercentiles.perc_40'],
      merchandise['errorpercentiles.perc_50'],
      merchandise['errorpercentiles.perc_60'],
      merchandise['errorpercentiles.perc_70'],
      merchandise['errorpercentiles.perc_80'],
      merchandise['errorpercentiles.perc_90'],
      merchandise['errorpercentiles.perc_100']
    ]
  }
})
Enter fullscreen mode

Exit fullscreen mode



Dashboard web site

Within the final step, we construct the dashboard web site utilizing Vue.js.

If you’re thinking about copy-pasting working code, we have now ready all the instance in a CodeSandbox. Under, we clarify the constructing blocks of our utility.

We outline the principle Vue part encapsulating all the web site content material. Within the script part, we are going to obtain the mannequin and variant names. Within the template, we iterate over the retrieved fashions and generate a chart for all of them.

We put the charts within the Suspense part to permit asynchronous loading.

To maintain the instance quick, we are going to skip the CSS model half.

​​<script setup>
  import OwnerName from './parts/OwnerName.vue'
  import ChartView from './parts/ChartView.vue'
  import axios from 'axios'
  import { ref } from 'vue'
  const fashions = ref([]);
  axios.get(SERVER_URL + '/fashions').then(response => {
    fashions.worth = response.information
  });
</script>

<template>
  <header>
    <div class="wrapper">
      <OwnerName title="Check Inc." />
    </div>
  </header>
  <primary>
    <div v-for="mannequin in fashions" v-bind:key="mannequin.modelName">
      <Suspense>
        <ChartView v-bind:title="mannequin.modelName" v-bind:variant="mannequin.modelVariant" sort="percentiles"/>
      </Suspense>
    </div>
  </primary>
</template>
Enter fullscreen mode

Exit fullscreen mode

The ​​OwnerName part shows our shopper’s title. We’ll skip its code because it’s irrelevant in our instance.

Within the ChartView part, we use the vue-chartjs library to show the charts. Our setup script incorporates the required imports and registers the Chart.js parts:

​​​​import { Bar } from 'vue-chartjs'
import { Chart as ChartJS, Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale } from 'chart.js'
import { ref } from 'vue'
import axios from 'axios'
ChartJS.register(Title, Tooltip, Legend, BarElement, CategoryScale, LinearScale);
Enter fullscreen mode

Exit fullscreen mode

We’ve got certain the title, variant, and chart sort to the ChartView occasion. Due to this fact, our part definition should include these properties:

const props = defineProps({
  title: String,
  variant: String,
  sort: String
})
Enter fullscreen mode

Exit fullscreen mode

Subsequent, we retrieve the chart information and labels from the back-end service. We may also put together the variable containing the label textual content:

​​const response = await axios.get(SERVER_URL + '/' + props.sort + '?mannequin=' + props.title + '&variant=' + props.variant)
const information = response.information[0].values;
const labels = response.information[0].labels;
const label_text = "Variety of prediction errors of a given worth"
Enter fullscreen mode

Exit fullscreen mode

Lastly, we put together the chart configuration variables:

const chartData = ref({
  labels: labels,
  datasets: [
    {
      label: label_text,
      backgroundColor: '#f87979',
      data: data
    }
  ],
});

const chartOptions = {
  plugins: {
    title: {
      show: true,
      textual content: props.title + ' - ' + props.variant,
    },
  },
  legend: {
    show: false
  },
  tooltip: {
    enabled: false
  }
}
Enter fullscreen mode

Exit fullscreen mode

Within the template part of the Vue part, we cross the configuration to the Bar occasion:

<template>
  <Bar ref="chart" v-bind:chart-data="chartData" v-bind:chart-options="chartOptions" />
</template>
Enter fullscreen mode

Exit fullscreen mode

If we have now finished every thing appropriately, we should always see a dashboard web page with error distributions.

Charts displaying the error distribution for different model variants



Wrapping up

Thanks for following this tutorial.

We encourage you to spend a while studying the Cube and Ahana documentation.

Please do not hesitate to love and bookmark this publish, write a remark, give Dice a star on GitHub, be part of Dice’s Slack neighborhood, and subscribe to the Ahana e-newsletter.

The Article was Inspired from tech community site.
Contact us if this is inspired from your article and we will give you credit for it for serving the community.

This Banner is For Sale !!
Get your ad here for a week in 20$ only and get upto 10k Tech related traffic daily !!!

Leave a Reply

Your email address will not be published. Required fields are marked *

Want to Contribute to us or want to have 15k+ Audience read your Article ? Or Just want to make a strong Backlink?