SQL Basics (Zero to Hero)- Part 01

SQL – Structured Query Language

A table is a collection of related data entries and it consists of columns and rows.

Eg: SELECT * FROM Customers;

–> It will retrieve all records data from customer table.

Semicolon is the standard way to separate each SQL statement in database systems that allow more than one SQL statement to be executed in the same call to the server.

Most Important SQL Commands

SELECT - extracts data from a database
UPDATE - updates data in a database
DELETE - deletes data from a database
INSERT INTO - inserts new data into a database
CREATE DATABASE - creates a new database
ALTER DATABASE - modifies a database
CREATE TABLE - creates a new table
ALTER TABLE - modifies a table
DROP TABLE - deletes a table
CREATE INDEX - creates an index (search key)
DROP INDEX - deletes an index

SQL SELECT Statement
The SELECT statement is used to select data from a database.

--> SELECT column1, column2,...  FROM table_name;
--> SELECT * FROM table_name;
--> SELECT DISTINCT column1 FROM table_name        #Unique values
--> SELECT COUNT(DISTINCT Country) FROM Customers;  #Gives count

SELECT column1, column2, ...
FROM table_name
WHERE condition;

SELECT * FROM Customers
WHERE Country='India' OR Country='USA';

SQL ORDER BY Keyword
ORDER BY keyword is used to sort the result-set in ascending or descending order.
The ORDER BY keyword sorts the records in ascending order by default. To sort the records in descending order, use the DESC keyword.

Syntax:
SELECT column1, column2, ...
FROM table_name
ORDER BY column1, column2, ... ASC|DESC;

Example:
SELECT * FROM Customers
ORDER BY Country DESC;

SQL INSERT INTO Statement
The INSERT INTO statement is used to insert new records in a table.

INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);
Example:
INSERT INTO Customers (CustomerName, City, Country)
VALUES ('Suresh', 'Bengaluru', 'India');

A field with a NULL value is a field with no value.

If a field in a table is optional, it is possible to insert a new record or update a record without adding a value to this field. Then, the field will be saved with a NULL value.

SELECT column_names
FROM table_name
WHERE column_name IS NULL;

SQL UPDATE Statement
The UPDATE statement is used to modify the existing records in a table.

UPDATE table_name
SET column1 = value1, column2 = value2, ...
WHERE condition;
Here, WHERE clause that determines how many records will be updated.

SQL DELETE Statement
The DELETE statement is used to delete existing records in a table.

DELETE FROM table_name WHERE condition;

DELETE FROM Customers WHERE CustomerName="Alfreds Futterkiste";

It is possible to delete all rows in a table without deleting the table. This means that the table structure, attributes, and indexes will be intact:

DELETE FROM table_name;

SQL MIN() and MAX() Functions
The MIN() function returns the smallest value of the selected column.

The MAX() function returns the largest value of the selected column.

SELECT MIN(column_name) FROM table_name WHERE condition;

SELECT MAX(column_name) FROM table_nameWHERE condition;

SQL COUNT(), AVG() and SUM() Functions
The COUNT() function returns the number of rows that matches a specified criterion.
The AVG() function returns the average value of a numeric column.
The SUM() function returns the total sum of a numeric column.

SELECT COUNT(column_name) FROM table_name WHERE condition;
SELECT AVG(column_name)   FROM table_name WHERE condition;
SELECT SUM(column_name)   FROM table_name WHERE condition;

SQL LIKE Operator
The LIKE operator is used in a WHERE clause to search for a specified pattern in a column.

There are two wildcards often used in conjunction with the LIKE operator:

The percent sign (%) represents zero, one, or multiple characters
The underscore sign (_) represents one, single character

SELECT column1, column2 FROM table_name
WHERE columnN LIKE pattern;

Example:
WHERE CustomerName LIKE 'a%'-->Finds any values that start with "a"

SQL IN Operator
The IN operator allows you to specify multiple values in a WHERE clause.

The IN operator is a shorthand for multiple OR conditions.

SELECT column_name(s)
FROM table_name
WHERE column_name IN (value1, value2, ...);

SQL BETWEEN Operator
The BETWEEN operator selects values within a given range. The values can be numbers, text, or dates.

The BETWEEN operator is inclusive: begin and end values are included.

SELECT column_name(s)
FROM table_name
WHERE column_name BETWEEN value1 AND value2;

Example:
SELECT * FROM Products WHERE Price NOT BETWEEN 10 AND 20;

SQL GROUP BY Statement
The GROUP BY statement groups rows that have the same values into summary rows, like “find the number of customers in each country”.

The GROUP BY statement is often used with aggregate functions (COUNT(), MAX(), MIN(), SUM(), AVG()) to group the result-set by one or more columns.

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
ORDER BY column_name(s);

SQL HAVING Clause
The HAVING clause was added to SQL because the WHERE keyword cannot be used with aggregate functions.

SELECT column_name(s)
FROM table_name
WHERE condition
GROUP BY column_name(s)
HAVING condition
ORDER BY column_name(s);
**Comments**
Single line comments start with --.
Multi-line comments start with /* and end with */.

Source link

Using Materialize and Redpanda to Analyze Raspberry Pi Temperature Data



Introduction

This is a self-contained demo using Materialize to process data IoT devices data directly from a PostgreSQL server.

The demo builds up on the How to build AdonisJS API to store your Raspberry Pi Temperature tutorial.

The data is generated by a Raspberry Pi temperature mock service simulating 50 devices reporting to an AdonisJS API mock service.

Finally, we will create a sink to let us stream the data out of Materialize to a Redpanda topic.

mz-raspberry-pi-temperature diagram



Prerequisites

Before you get started, you need to make sure that you have Docker and Docker Compose installed.

You can follow the steps here on how to install Docker:



Overview

In this demo, we’ll look at monitoring the temperature of a set of Raspberry Pi devices and extracting some insights from them, and streaming the data out to an external source.



Raspberry Pi Mock

The main source of data is a Raspberry Pi Mock service, that simulates 50 devices reporting their CPU temperature to a mock API service built with AdonisJS.

The mock service generates about ~25 new requests to the mock API service every second.

For more information on how the mock services works along with the AdonisJS API, you can follow the How to build AdonisJS API to store your Raspberry Pi Temperature tutorial.



API Mock service and PostgreSQL

The API mock service receives the data from the 50 simulated Raspberry Pi and stores the data from each request in a PostgreSQL instance.

The data that is being received with each request is:

  • The name of the Raspberry Pi device.
  • The timestamp when the temperature was measured.
  • The temperature of the device, in celsius.

The Mock API will save all data in a table called sensors. The columns of the sensors table are:

  • name
  • timestamp
  • temperature



Materialize

Materialize presents an interface to ingest the temperature data from the PostgreSQL database.

In this demo, we are going to use Materialize to:

  • Create a PostgreSQL source
  • Materialize the PostgreSQL data, which will be retained all in memory.
  • Provide a SQL interface to query the temperature data. We will connect to Materialize through mzcli, which is our forked version of pgcli.
  • Explore the Materialize data via Metabase.



Running the demo

Clone the repository:

git clone https://github.com/bobbyiliev/mz-raspberry-pi-temperature.git

Access the directory:

cd mz-raspberry-pi-temperature

Build the Raspberry Pi Mock images:

docker-compose build

Start all of the services:

docker-compose up -d



Access Materialize

docker-compose run mzcli



Create Materialize Source:

To create a PostgreSQL Materialize Source execute the following statement:

CREATE MATERIALIZED SOURCE "mz_source" FROM POSTGRES
CONNECTION 'user=postgres port=5432 host=postgres dbname=postgres password=postgres'
PUBLICATION 'mz_source';

A quick rundown of the above statement:

  • MATERIALIZED: Materializes the PostgreSQL source’s data. All of the data is retained in memory and makes sources directly selectable.
  • mz_source: The name for the PostgreSQL source.
  • CONNECTION: The PostgreSQL connection parameters.
  • PUBLICATION: The PostgreSQL publication, containing the tables to be streamed to Materialize.



Create a view:

Once we’ve created the PostgreSQL source, in order to be able to query the PostgreSQL tables, we would need to create views that represent the upstream publication’s original tables. In our case, we only have one table called sensors so the statement that we would need to execute is:

CREATE VIEWS FROM SOURCE mz_source (sensors);

To see the available views execute the following statement:

SHOW FULL VIEWS;

Once that is done, you can query the new view directly:

SELECT * FROM sensors;

Next, let’s go ahead and create a few more views.



Creating more materialized views

If you wish you can enable timing so we could actually see how long it takes for each statement to be executed:

timing
  • Example 1: Create a materialized view to show the total number of sensors data:
CREATE MATERIALIZED VIEW mz_count AS SELECT count(*) FROM sensors;

Querying the mz_count view:

SELECT * FROM mz_count;

Output:

 count
-------
 34565
(1 row)

Time: 2.299 ms
  • Example 2: Create a view to show the average temperature of all sensors:
CREATE MATERIALIZED VIEW mz_total_avg AS SELECT avg(temperature::float) FROM sensors;

Query the mz_total_avg:

SELECT * FROM mz_total_avg;

Output:

        avg
-------------------
 59.02989081226408
(1 row)

Time: 2.984 ms
  • Example 3: Create a view to show the average temperature of each separate sensor:
CREATE MATERIALIZED VIEW average AS
    SELECT name::text, avg(temperature::float) AS temp 
    FROM sensors
    GROUP BY (name);

Let’s again query the average view:

sql
SELECT * FROM average LIMIT 10;
`

Output:

`sql
name | temp
————–+——————–
raspberry-1 | 58.60756530123859
raspberry-2 | 58.95694631912029
raspberry-3 | 58.628198038515066
raspberry-4 | 59.40673999174753
raspberry-5 | 59.079367226960734
raspberry-6 | 58.96244838239402
raspberry-7 | 58.4658871719401
raspberry-8 | 58.9830811196705
raspberry-9 | 59.398486896836936
raspberry-10 | 59.669463513068024
(10 rows)

Time: 2.353 ms
`

Feel free to experiment by creating more materialized views.



Creating a Sink

Sinks let you send data from Materialize to an external source.

For this demo, we will be using Redpanda.

Redpanda is a Kafka API-compatible and Materialize can process data from it just as it would process data from a Kafka source.

Let’s create a materialized view, that will hold all of the devices with an average temperature of more than 60 celsius:

sql
CREATE MATERIALIZED VIEW mz_high_temperature AS
SELECT * FROM average WHERE temp > 60;

If you were to do a SELECT on this new materialized view, it would return only the devices with an average temperature of above 60 celsius:

sql
SELECT * FROM mz_high_temperature;

Let’s create a Sink where we will send the data of the above materialized view:

sql
CREATE SINK high_temperature_sink
FROM mz_high_temperature
INTO KAFKA BROKER 'redpanda:9092' TOPIC 'high-temperature-sink'
FORMAT AVRO USING
CONFLUENT SCHEMA REGISTRY 'http://redpanda:8081';

Now if you were to connect to the Redpanda container and use the rpk topic consume command, you will be able to read the records from the topic.

However, as of the time being, we won’t be able to preview the results with rpk because it’s AVRO formatted. Redpanda would most likely implement this in the future, but for the moment, we can actually stream the topic back into Materialize to confirm the format.

First, get the name of the topic that has been automatically generated:

sql
SELECT topic FROM mz_kafka_sinks;

Output:

`sql



topic

high-temperature-sink-u12-1637586945-13670686352905873426
`

For more information on how the topic names are generated check out the documentation here.

Then create a new Materialized Source from this Redpanda topic:

sql
CREATE MATERIALIZED SOURCE high_temp_test
FROM KAFKA BROKER 'redpanda:9092' TOPIC 'high-temperature-sink-u12-1637586945-13670686352905873426'
FORMAT AVRO USING CONFLUENT SCHEMA REGISTRY 'http://redpanda:8081';

Make sure to change the topic name accordingly!

Finally, query this new materialized view:

sql
SELECT * FROM high_temp_test LIMIT 2;

Now that you have the data in the topic, you can have other services connect to it and consume it and then trigger emails or alerts for example.



Metabase

In order to access the Metabase instance visit http://localhost:3030 if you are running the demo locally or http://your_server_ip:3030 if you are running the demo on a server. Then follow the steps to complete the Metabase setup.

To connect to your Materialize database, specify the following connection properties:


Field | Value
----------------- | ----------------
Database | PostgreSQL
Name | user_reviews
Host | materialized
Port | 6875
Database name | materialize
Database username | materialize
Database password | Leave empty

Once ready you will be able to visualize your data just as you would with a standard PostgreSQL database.

Metabase



Conclusion

This is a simple example of how to use the direct PostgreSQL connection with Materialize and stream data into a Kafka/Redpanda topic.

In most cases, you would not store your IoT devices data in a PostgreSQL database, but in an S3 bucket or a Kafka topic. So the setup could be similar to the following:

  • S3 Bucket example:
    IoT devices data with Materialize and S3

  • Redpanda/Kafka example:
    IoT devices data with Materialize and Redpanda



Stopping the Demo

To stop all of the services run the following command:


docker-compose down



Helpful resources:


Source link

SQL vs NoSQL Database – A Complete Comparison

There are two main categories of database in use in the development world today, commonly referred to as SQL and NoSQL. In this article, we will compare an SQL vs. NoSQL database based on their pros and cons.

SQL, or Structured Query Language, is the universally known query language for relational databases. SQL databases make it simpler to work with structured data in a database through CRUD operations. CRUD stands for create, retrieve (or read), update, and delete – the primary operations for manipulating data.

SQL databases are commonly referred to as relational database management systems (RDBMS). Traditional RDBMS uses SQL syntax as these systems utilize row-based database structures that connect related data objects between tables. Examples of RDBMS SQL databases include Backendless, Microsoft Access, MySQL, Microsoft SQL Server, SQLite, Oracle Database, IBM DB2, etc.

NoSQL databases, on the other hand, are databases without any structured tables fixed for holding data. Technically, all non-relational databases can be called NoSQL databases. Because a NoSQL database is not a relational database, it can be set up very quickly and with minimal pre-planning. Examples of NoSQL databases include MongoDB, DynamoDB, SimpleDB, CouchDB, CouchBase, Orient DB, Infinite Graph, Neo4j, FlockDB, Cassandra, HBase, etc.

As of May 2021, five of the top six database systems according to the DB-Engines ranking are relational databases, including the top four – Oracle, MySQL, Microsoft SQL Server, and PostgreSQL.

DB-Engine May 2021 Ranking

In this article, we will take a deep dive into the pros and cons of SQL and NoSQL databases.

Contents

1. SQL Pros

2. SQL Cons

3. NoSQL Pros

4. NoSQL Cons

5. Conclusion



SQL pros

Broadly speaking, SQL databases require more advance preparation and planning of the relational model, but the benefit is that your data will be consistent and clean. The relational model represents how data is stored in the database, such as how each table is structured and related to other tables.



Standardized schema

Although SQL databases with standardized schema and relational databases are typically more rigid and difficult to modify, they still have many benefits. Every data object added to the database must conform to the recognized schema of linked tables (comprising rows and columns). While some could find this restrictive, it is essential for data compliance, integrity, consistency, and security.



A large number of users

SQL is an established programming language that is very widely used. It has a large user community comprising countless experts vast in well-established best practices. Developing a strong working knowledge of SQL can give application developers numerous opportunities to consult, collaborate and sharpen their skills.



ACID compliance

SQL databases are ACID compliant (described in detail below) thanks to how relational database tables are precisely structured. This helps ensure that tables are in-sync and transactions are valid. It is the best choice when running applications with no room for error. An SQL database supports a high data integrity level.

SQL database ACID

ACID properties:

  • Atomicity: All data and transactional changes are completely executed as a single operation. No changes are performed if that is not possible.
  • Consistency: Data must be consistent and valid at the beginning and completion of a transaction.
  • Isolation: Transactions run synchronously, without any competition. They act as though they are happening consecutively.
  • Durability: Once a transaction is complete, its connected data is permanent and cannot be altered.

Let’s look at an inventory management system as an example. For such a system, it is important that items be removed from inventory as soon as they are purchased so as to prevent overstock or understock issues. When an order is placed, the inventory can be updated, a new shipment data object can be created, payment information can be updated, and the customer information can be updated. All of these related tables will be updated in unison in order for the transaction to complete.

Learn more about relational database management with Backendless using our Shipping and Tracking app blueprint.



Requires little to no code

SQL is a developer-friendly language. It uses plain English, making it easy to learn to manage and query any relational database while using only simple keywords without coding.

Backendless Database queries, for example, can be written using SQL. Additionally, SQL terminology is used to craft precise API calls to access and modify data. Using Database Views, you can create these queries visually, making it even easier for those without a background in writing SQL queries.



SQL cons



Hardware

SQL databases have historically required that you scale up vertically. This meant you could only expand capacity by increasing capabilities, such as CPU, SSD, and RAM, on the existing server or by purchasing a larger, costlier one.

As your data continues to grow, you’ll invariably need to constantly increase hard drive space and require faster and more efficient machines to run newer and more advanced technologies. With this, hardware can quickly become obsolete.

Modern SQL databases may use a process called sharding. Sharding allows for horizontal scaling by separating, or partitioning, data among multiple data tables with identical schemas. Rather than storing 100,000 objects in one table, for example, sharding creates two tables with identical schemas that each store 50,000 objects, with no duplication between the tables.

Of course, utilizing a serverless hosting service such as Backendless can alleviate the scaling concern. The Backendless system is designed to manage scaling automatically for you, so that you don’t have to worry about physical server management while achieving database efficiency at scale.



Rigidity

A traditional relational model, or schema, of a SQL database has to be defined before use. Once this is done, they become inflexible, and any adjustment can become resource-intensive and difficult. Due to this, significant time should be invested in planning before putting the database into production.

With Backendless, however, developers can always modify schema even after their app is launched. New tables and columns can be added, relations established, etc., providing greater flexibility than a traditional SQL database. This makes the Backendless system well suited for early product development as you are not locked into a schema at the beginning of the development process.



Data Normalization

The goal behind the development of relational databases is to negate data duplication. There is different information for each table, and this information can be queried and connected using common values. But, when SQL databases become large, the joins and lookups needed between several tables can slow things down considerably.

To put more simply, relational databases commonly store related data in different tables. The more tables storing data needed for a single query, the more processing power is needed to complete that query without the system slowing down significantly.



Traditionally resource-intensive upgrade and scaling

As previously mentioned, vertical scaling-up of SQL databases is done by expanding hardware investment. This is costly and time-consuming to do on your own. Some organizations try to scale up horizontally through partitioning. However, this further complexity increases the resources and time expended. It will likely involve coding and require highly-skilled, well-paid developers.

Systems like Backendless, however, are designed to manage the scaling process for you automatically. This is often referred to as infrastructure as a service, or IaaS, and is far less expensive than managing infrastructure yourself. IaaS providers handle the difficult tasks of server maintenance and resource allocation for you so that you can focus on building a great product without worrying about what will happen when your database grows.



NoSQL pros



Query speed

NoSQL queries are denormalized. Therefore, with no fear of data duplication, all the needed information for a specific query is often stored together. This means that joins are not required. As a result, lookups are easier when dealing with large volumes of data. NoSQL is very fast for simple queries.



Continuous availability

For a NoSQL database, data is distributed across different regions and multiple servers, implying no single failure point. This makes NoSQL databases more resilient and stable, with zero downtime and continuous availability.



Agility

This database gives developers enough flexibility to help improve their productivity and creativity. They are not bound by rows and columns, and their schemas do not have to be predefined. They are dynamic such that they can handle all data types, including polymorphic, semi-structured, structured, and unstructured.

SQL vs NoSQL database flexibilityImage source

Application developers can just come in and start building a database without needing to spend effort and time on planning upfront. It allows for quick modifications when there are changes in requirements or a new data type needs to be added. This flexibility makes this database a perfect fit for companies with varying data types and constantly changing features.



Low-cost scaling

It is cost-effective to expand the capacity as a NoSQL database scales up horizontally. Instead of upgrading costly hardware, the difference with this database is that you can expand cheaply by simply adding cloud instances or commodity servers. Also, many open-source NoSQL databases offer cheap options for many companies.



NoSQL cons



No standardized language

There is no fixed language for conducting NoSQL queries. There is variation in the syntax used in querying data for different NoSQL database types. Unlike SQL, where there is only one language to learn, NoSQL has a higher learning curve. Similarly, it can be more difficult to find experienced developers with knowledge of the NoSQL system that you have implemented. Thus, it is more likely that you will need to train new hires, increasing onboarding time and cost.



Inefficiency in conducting complex queries

Querying isn’t very efficient due to the numerous data structures present in NoSQL databases. There is no standard interface to perform complex queries. Conducting simple NoSQL queries might even require programming skills due to the structure of your data. As a result, costlier and more technical staff might be needed to perform the queries. This is one of the major NoSQL limitations, particularly for less technical (i.e. no-code) developers.



A smaller number of users

Developers are now starting to use NoSQL databases more and more and are quickly becoming a growing community. However, it is still not as mature as the SQL community. Also, with fewer experts and consultants, it could be more difficult to solve undocumented issues.



Inconsistency in data retrieval

Data is quickly available thanks to the distributed nature of the database. However, it could also be harder to ensure that the data is always consistent. Sometimes, queries might not return updated data or accurate information. The distributed approach makes it possible for the database to return different values consecutively, depending on the queried server.

This is a major reason why NoSQL is not ACID-level compliant. “C” – Consistency implies that data must be consistent and valid at the beginning and completion of a transaction. Rather, many NoSQL databases are BASE compliant, where “E” signifies Eventual Consistency. NoSQL places importance on availability and speed over consistency. Inconsistency in data retrieval is one of the major drawbacks of NoSQL databases.



Conclusion – Considering your options

Both SQL and NoSQL databases are used in meeting specific needs. Depending on the goals and data environment of an organization, their specific pros and cons could be amplified.

A common misconception is that it is bad to use both technologies together; as a matter of fact, you can use both together, such that each database type play to its strengths. Many companies use both databases within their cloud architecture. Some even use it within the same application.

In the end, it is all about weighing your options and going with the preferred choice that best suits your needs.

Thanks for reading, and Happy Codeless Coding!


Source link

How I created a mini ORM with python?



Introduction

When you code in a Django web framework, we all know that you won’t directly work with databases. There is an ORM (Object relational mapping) who will interact with database using migration and SQL operation. So, I want in this tutorial to show you how to implement an ORM manager from scratch.

Object–relational mapping (ORM) in computer science is a technique for converting data between incompatible type systems using object-oriented programming languages in order to create virtual database objects.

Python is a programming language that enables you to design freely anything that you want.

And honestly I was able to implement a Mini ORM manager merely within 3 hours.



Agenda



Project Design

Our Project is divided into 3 main components:

  • A connection manager to connect directly with a database using SQL commands (we used SQLite3 in our case).
  • A model manager or model file that contains all the definition of the models we need to migrate to a database.
  • A simple command manager that enables user to input a command (in the command prompt).

The Project files are also 3 python files (check this Github repository ):

Python file Functionality
migrate_manager.py Migration manager
base.py Model manager
db_manager.py Database manager



Database manager

We mentioned before that we will use SQLite database. Thanks to python sqlite3 default library, we can connect directly with a database using a python script. If you check the documentation of sqlite3 library, you will notice that it is very easy to connect SQLite using python.

Just a few lines of python code will do the job. So, you need to instantiate an sqlite3 connection object with specifying the database file “example.db”, then make a cursor to execute SQL commands without forgetting to commit the changes and to close the connection.

First import the sqlite3 library and instantiate the connection:

import sqlite3
con = sqlite3.connect('example.db')

Then open a cursor, execute the commands you want and finally close the connection:

cursor= con.cursor()
cursor.execute('''CREATE TABLE stocks (date text, trans text, symbol text, qty real, price real)''')
cursor.execute("INSERT INTO stocks VALUES ('2006-01-05','BUY','RHAT',100,35.14)")
connection.commit()
connection.close()

However, this kind of implementation won’t be reliable in our project, because there is too much redundancy of this block of code. Imagine I will repeat this code with every SQL operation I use including CREATE, DELETE, UPDATE and RETRIEVE.

In fact, this problem has a solution, there is a thing in python called context manager. Context manager is a way to manage resources precisely without any cost in complexity and memory.

In order to make a context manager that can connect with a database, we make a class called ConnectionSqliteManager that opens an SQL connection and instantiates the cursor object.

In this class, we need to write an __enter__ and __exit__ magic class methods as a necessary ingredient for database resource management.

So the structure of the context manager class is as follows :

  • The __enter__ method connects the example.db file (setup operation) and returns the Connection object to variable Connection.
  • The __exit__ method takes care of closing the connection on exiting the with block(teardown operation).

In the scope with , you manage database within the scope, without taking the bother of opening and closing the connection manager.

class ConnectionSqliteManager:
    def __init__(self,filename):
        self.filename = filename

    def __enter__(self):
        print("Connection started ...")
        self.connection = sql3.connect(self.filename)
        self.cursor = self.connection.cursor()
        return self  
    def __exit__(self, type, value, traceback):
        print("Connection ended ...")
        self.connection.close()

with ConnectionSqliteManager("example.db") as Connection:
    # Do what you want with Connection instance



TIP:

You can make a decorator (Inside “ConnectionSqliteManager” class) on top of each SQL operation method, to commit the changes.

Never forget to add some sqlite3 error exception in each command execution.

#inside ConnectionSqliteManager class
def commit(operation):
        def wrapper(self, *args,**kwargs):
            operation(self, *args, **kwargs)
            self.connection.commit()
        return wrapper
@commit
def Sql_operation(self):
    # execute an sql command here
    pass



Model manager

A model is a datastore entity that has a key and a set of properties. . A model is a Python class that inherits from the Model class. The model class defines a new kind of datastore entity and the properties the kind is expected to take.

I won’t dive into too much implementation here. all you have to do is to define a model class like the following:

class Model(baseModel):
    base_model = base
    tablename = "Model"
    fields = ("field_1", "field_2")

In this code, you need to specify the name of the table of the database and its fields.

In order to prepare your model for migration, you have to add it in the model_list list:

model_list = [Model,
              ]



Command manager

Now let’s prepare the user for a good interface in our command prompt. To implement it, we used the argparse default library to let the user input the arguments in the command prompt and execute the migration.

all details are in the migrate_manager.py file, there is no need to explain each line of code. I made it as simple as possible.

So all you have to do is to execute this command:

python migrate_manager.py migrate

The output is as following:

Begin database Migration ...

Model Migration
Connection started ...
Model: created successfully!
2022-01-04 02:29:53.402991: Commit is successful!!
Connection ended ...



Conclusion

It’s good to implement from scratch a technique like ORM, it will help you understand and learn quickly technologies without any difficulties with grasping the concept behind.


Source link

DEMO: Relational Data And Custom Codeless APIs (App Blueprint)

Backendless utilizes relational data models to improve database performance and allow users to recreate real-world relationships. The Shipping and Tracking App Blueprint showcases these capabilities in action, with the help of custom Codeless APIs.

App Blueprints are more than just templates from which to build a specific type of application. We design all of our free App Blueprints to provide one or more key functions that a developer would want to add to their application, regardless of the app’s business use.

The free Shipping and Tracking App Blueprint showcases how to present data from multiple related tables using custom APIs. The entirety of this app blueprint was built without code using Backendless’ UI Builder.

The app heavily relies on relational data to connect table objects in a way that best represents the real-world. To access the data, the app utilizes several custom APIs to present this data to users and administrators.

In this article, we will explore the basics of this functionality (click the link to jump to the appropriate section):

Additionally, all App Blueprints include thorough instructions to guide you through how to make your own modifications and customizations.

The web app, which functions identically to a website, can be purchased for free when creating a new app in Backendless Console. It features a beautiful, responsive design that will look great on any size screen.

Click here to view the live demo and follow along (opens in a new tab).



Utilizing data relations to connect a wide range of objects in different tables

Database visual schema model

On first glance, the database of this application can appear very complex. For example, when we look at the delivery table tree, each table has at least one relation connecting it to another table in the data model.

When we dig down into the structure, however, we can understand the logic behind this design. Let’s break down each step.

  • Each Delivery has a Dispatcher assigned and an order in the Delivery2Order table.
    Delivery table to Dispatcher and Delivery2Order tables
  • Delivery2Order contains the OrderShipment information.
    Delivery2Order table to OrderShipment table
  • OrderShipment information includes the User that created the order, the OrderSender and OrderReceiver, and the Rate paid for the shipment.
    OrderShipment table to Users, OrderSender, OrderReceiver, and Rate tables
  • The OrderSender and OrderReceiver both have a Country and may have a State related as well.
    OrderSender table to State and Country tables

Each table contains pieces of information that, when combined, form the whole of the shipment details.

In order to ensure data integrity across all shipments, relations are used. One reason is that the admin can change information in one table – for example, updating shipping rates – and automatically update all relevant orders. Similarly, individual elements of an order cannot be modified or deleted without impacting the entire record tree.

Finally, using relational data tables allows for easier indexing of data objects. This means that your database will query faster than a NOSQL-style database, such as a product like Google’s Firebase.



Building custom Codeless APIs to interact with related data tables

Custom Codeless APIs

The Shipping and Tracking app utilizes six custom API services and 24 custom functions in UI Builder. All of these are built without code.

The custom Codeless API services can be found by navigating to the Backend -> Business Logic -> Codeless, as pictured above. When using a custom API with Codeless, you can see all available API blocks by scrolling to the bottom (lower left-hand corner, pictured above).

In this app, the custom APIs and their related methods are designed to allow the developer to quickly access important data within the data model. Once built, these APIs can be easily inserted into other functions or even new APIs using automatically-generated Codeless blocks.



Displaying related data on the frontend using UI Builder

Once you understand the data model and how it is being utilized by the custom API services, you will be able to put these elements into action on the frontend.

Note: The name of the page you are viewing can be found in the URL you are on. For instance, if you are navigating through the demo of the app and want to dive deeper into a page in your Backendless Console, you can find the page name here:

https://nimblepunishment.backendless.app/app/index.html?page=adminNavRates&data=%7B%7D



Admin dashboard

Admin dashboard - orders

The admin dashboard provides three sections: Orders, Dispatches, and Rates. Each of these sections contain functionality for the administrator.

  • In the Orders section, the admin can view order details or select one or more items to schedule for delivery or pickup.
  • In the Dispatches section, the admin can mark items as completed, complete with date and time delivered, or add new dispatchers.
  • In the Rates section, the admin can add new rates, including origin and destination locations, and modify existing rates.

Admin dashboard - dispatches

By visiting the page in UI Builder, you can see the logic that is used to achieve each function. If you have additional questions about how a page or function works, feel free to ask on our support forum and we’ll be happy to help.



Shipment tracker

Shipment tracker

The shipment tracking page (TrackPage) is where all of your data is brought together to be displayed for the end user. This page shows how each shipment is compiled into a single data record and then populated on the screen using Dynamic List Behaviour.



User portal and shipment creator

User portal

The user portal serves two primary functions:

  1. Listing all shipments the user has made, which each link to the shipment tracker for the order
  2. Enabling the user to schedule a new shipment

For reference, the tracking ID for each shipment is the objectId found in the OrderShipment table.

User shipment creator

Note: This application does not include payment processing. However, you can easily add payment processing to the order page using our free Stripe plugin.



Closing

The Shipping and Tracking App is a beautiful example of logistics functionality in an ordering and shipping environment. More importantly, it provides Backendless users with a template to follow for a variety of key Backendless functions.

Using this template, you can see first-hand how to:

  • Utilize data relations to connect objects in different tables
  • Build custom Codeless APIs to interact with related data tables
  • Utilize custom functions to display related data on the frontend using UI Builder

Happy Codeless Coding!

This article was first published on backendless.com on December 7, 2021.


Source link

What Is Trino And Why Is It Great At Processing Big Data

Big data is touted as the solution to many problems. However, the truth is, big data has caused a lot of big problems.

Yes, big data can provide more context and provide insights we have never had before. It also makes queries slow, data expensive to manage, requires a lot of expensive specialists to process it, and just continues to grow.

Overall, data, especially big data, has forced companies to develop better data management tools to ensure data scientists and analysts can work with all their data.

One of these companies was Facebook who decided they needed to develop a new engine to process all of their petabytes effectively. This tool was called Presto which recently broke off into another project called Trino.

In this article, we outline what Trino is, why people use it and some of the challenges people face when deploying it.



Presto Trino History

Before diving into what Trino is and why people use it. Let’s recap how Presto turned into Trino.



What Is Trino

Let’s be clear. Trino is not a database. This is a misconception. Just because you utilize Trino to run SQL against data, doesn’t mean it’s a database.

Instead, Trino is a SQL engine. More specifically, Trino is an open-source distributed SQL query engine for adhoc and batch ETL queries against multiple types of data sources. Not to mention it can manage a whole host of both standard and semi-structured data types like JSON, Arrays, and Maps.

Another important point to discuss about Trino is the history of Presto. In 2012 Martin Traverso, David Phillips, Dain Sundstrom and Eric Hwang were working at Facebook and developed Presto to replace Apache Hive to better process the hundreds of petabytes Facebook was trying to analyze.

Due to the creators desire to keep the project open and community based they open sourced it in November 2013.

But due to Facebook wanting to have tighter control over the project there was an eventual split.

The original creators of Presto decided that they wanted to keep Presto open-source and in turn pursued building the Presto Open Source Community full-time. They did this under the new name PrestoSQL.

Facebook decided to build a competing community using The Linux Foundation®. As a first action, Facebook applied for a trademark on Presto®. This eventually led to litigation and other challenges that forced the original group who developed Presto to rebrand.

Source

Starting in December 2020 PrestoSQL was rebranded as Trino. This has been a little confusing but now Trino supports a lot of end-users and has a large base of developers that commit to it regularly.



How does Trino work?

Trino is a distributed system that utilizes an architecture similar to massively parallel processing (MPP) databases. Like many other big data engines there is a form of a coordinator node that then manages multiple worker nodes to process all the work that needs to be done.

An analyst or general user would run their SQL which gets pushed to the coordinator. In turn the coordinator then parses, plans, and schedules a distributed query. It supports standard ANSI SQL as well as allows users to run more complex transformations like JSON and MAP transformations and parsing.



Why People Use Trino

Trino, being a Presto spin-off, has a lot of benefits that came from its development by a large data company that needs to easily query across multiple data sources without spending too much time processing ETLs. In addition, it was developed to scale on cloud-like infrastructure. Although, most of Facebook’s infrastructure is based on its internal cloud. But let’s dig into why people are using Trino.



Agnostic Data Source Connections

There are plenty of options when it comes to how you can query your data. There are tools like Athena, Hive and Apache Drill.

So why use Trino to run SQL?

Trino provides many benefits for developers. For example, the biggest advantage of Trino is that it is just a SQL engine. Meaning it agnostically sits on top of various data sources like MySQL, HDFS, and SQL Server.

So if you want to run a query across these different data sources, you can.

This is a powerful feature that eliminates the need for users to understand connections and SQL dialects of underlying systems.



Cloud-focused

Presto’s fundamental design of running storage and computing separately makes it extremely convenient to operate in cloud environments. Since the Presto cluster doesn’t store any data, it can be auto-scaled depending on the load without causing any data loss.



Use Cases For Trino

Adhoc queries And Reporting — Trino allows end-users to use SQL to run ad hoc queries where your data resides. More importantly, you can create queries and data sets for reporting and adhoc needs.

Data Lake Analytics — One of many of the common use cases for Trino is being able to directly query data on a data lake without the need for transformation. You can query data that is structured or semi-structured in various sources. This means you can create operational dashboards without massive transformations.

Batch ETLs — Trino is a great engine to run your ETL batch queries. That’s because it can process large volumes of data quickly as well as bring in data from multiple sources without always needing to extract data from sources such as MySQL.



Challenges For Trino Users

At this point, you might be assuming that everyone should use Trino. But it’s not that simple. Using Trino requires a pretty large amount of set-up for one. However, there are also a few other issues you might deal with upon set-up.



Federated Queries Can Be Slow

The one downside of federated queries is that there can be some trade-offs in speed. This can be caused by a lack of meta-data stored and managed by Trino to better run queries. In addition, Presto was initially developed at Facebook that essentially has their own cloud. For them to expand it and grow it as they need increased speed isn’t a huge problem. However, for other organizations, in order to get the same level of performance they might need to spend a lot more money to add more machines to their clusters. This can become very expensive. All to manage unindexed data.

One such example is Varada. Varada indexes data in Trino in such a way that reduces the time the CPU is used for data scanning (via indexes) and frees the cpu up for other tasks like fetching the data extremely fast or dealing with concurrency. thus allows SQL users to run various queries, whether across dimensions, facts as well as other types of joins and datalake analyticss, on indexed data as federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well, resulting in highly effective SQL analytics.



Configuration And Set-Up

Setting up Trino isn’t straightforward. Especially when it comes to optimizing performance and management. In turn, many system admins and IT teams will need teams to both set up and manage their instances of Trino.

One great example of this is an AWS article titled “Top 9 Performance Tuning Tips For PrestoDB“.



Lack Of Enterprise Features

One of the largest challenges faced by companies utilizing Trino is that out of the box, there aren’t a lot of features geared towards enterprise solutions. That is to say, features that revolve around security, access control, and even expanded data source connectivity are limited. Many solutions are trying to be better provided in this area.

A great example of this is Starburst Data.

Starburst Enterprise has several features that help improve the Trino’s lacking security features. For example, Starburst makes it easy for your team to set up access control.

The access control systems all follow role-based access control mechanisms with users, groups, roles, privileges and objects.

This is demonstrated in the image below

Source

This makes it easy for your security teams and data warehouse administrators to manage who has access to what data.

Starburst also offers other helpful security features such as auditing and encryption.

This enables companies to implement a centralized security framework without having to code their own modules for Trino.



Conclusion

Big data will continue to be a big problem for companies that don’t start looking for tools like Trino to help them manage all their data. Trino’s ability to be an agnostic SQL engine that can query large data sets across multiple data sources is a great option for many of these companies. But as discussed, Trino is far from perfect. It isn’t fully optimized in a way for enterprise companies to take advantage of its full abilities. In addition, due to Trino’s brute force approach to speed, it sometimes comes at a cost. It becomes very expensive to get the beneficial speed without indexing.

This is where many new solutions are coming into the fold to make Trino more approachable. In the end, big data can be maintained and managed, you just need the right tools to help set yourself up for success.

If you enjoyed this article, then check out these videos and articles below.

Data Engineer Vs Analytics Engineer Vs Analyst

Why Migrate To The Modern Data Stack And Where To Start

5 Great Data Engineering Tools For 2021 — My Favorite Data Engineering Tools

4 SQL Tips For Data Scientists

What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate


Source link

SQL Joins – Inner Join vs Outer Join

When storing information in a database, sometimes it will be necessary to link two tables that store information that is related.

There are multiple types of joins in SQL, but the main two categories are INNER JOIN and OUTER JOIN.

INNER JOIN:
In the simplest way, inner joins will only keep the information from the tables that is related. If you picture a Venn diagram, the inner join will only keep the information that is inside of the diagram.

The syntax for this is:

SELECT * FROM table1
JOIN table2
ON relation;

OUTER JOIN:
Outer joins will link the two tables, while also keeping unrelated data. There are three main types of outer joins in SQL: LEFT JOIN, RIGHT JOIN, and FULL JOIN.

LEFT JOIN will keep the unrelated data from the left (first) table. Picturing a Venn diagram, this join will keep the data is that is in the overlapping part of the diagram, as well as the data in the outside left circle. This is most commonly used.
The syntax for a LEFT JOIN is:

SELECT columns
FROM table1
LEFT JOIN table2
ON relation;

RIGHT JOIN will link the related data of the two tables, while keeping the unrelated data of table two (right table). The syntax for a RIGHT JOIN is the same as LEFT JOIN, but using the word ‘right’ instead.

A FULL JOIN will join the related information of two tables, while keeping the unrelated information.

The syntax for a FULL JOIN is:

SELECT columns
FROM table1
FULL JOIN table2
ON relation;


Source link

A Guide to Data Analytics with DuckDB

The life of a data analyst revolves around loading data through SQL serve, analyzing it using various Python packages, and finally creating the technical report. Sometimes these SQL queries can take a longer time to process as we are dealing with terabytes of data. In this fast-paced world, this strategy fails miserably and most analytics are moving away from traditional ways to doing data analytics. DuckDB solves all the issues, it is integrated with Python & R. It works similarly to SQLite and focuses on providing faster analytical queries.



Sample Code

res = duckdb.query(SELECT sex, SUM(expected_recovery_amount) as Expected,
 SUM(actual_recovery_amount) as Actual 
 FROM bank_data.csv
 WHERE recovery_strategy LIKE Level 4 Recovery
 GROUP BY sex
 HAVING sex = Female’”)
res.df()

code result
Read full guide on Analytics Vidhya


Source link

Data Access in F#

Hello there, this is the next entry in Simple Things F#.

Today we’ll talk about Database access. Databases are something we have to use very often after all it is where we store our data most common databases we use are

  • SQL Databases
  • NoSQL Databases

I won’t dive deep into the differences between them, rather than that I will focus on how you can access these databases from F# code.

As in previous entries I’ll be using F# scripts which can be executed with the .NET CLI that comes in the .NET SDK which you can get from here: https://get.dot.net

Let’s review our options for today

  • Dapper.FSharp – Dapper Wrapper (i.e. anything that MSSQL, PostgreSQL, MySQL supports)
  • Donald – ADO.NET wapper (i.e MSSQL, PostgreSQL, SQLite, MySQL, and others)
  • Mondocks – MongoDB DSL for the MongoDB .NET Driver

We will not complicate things and work with simple DB Schemas, and we will be using PostgreSQL since it’s a pretty common database used around the world, but please keep in mind these solutions (and others that I will share at the end) work with MSSQL and MySQL as well.

If you have docker installed, spin up a postgresql instance

docker run -d 
 --name my-instance-name 
 -e POSTGRES_PASSWORD=Admin123 
 -e POSTGRES_USER=admin
 -p 5432:5432 
 postgres:alpine

This will be our little schema, nothing fancy something just to get started with some F# code.

create table authors(
    id uuid primary key,
    name varchar(100),
    email varchar(100),
    twitter_handle varchar(100) null
);
create table posts(
    id uuid primary key,
    title varchar(140) not null,
    content text not null,
    authorId uuid references authors(id)
);

You can create the PostgreSQL database using any DB manager you already know. In case you don’t have anything available you can use dbeaver.

Once you have your database cretated and have the schema in place let’s begin with the cool stuff



Dapper.FSharp

If you like ORMs this is going to be a library for you, given that you can map records to tables so using them is seamless, also Dapper.FSharp adds a couple of F# types to make your life easier.

Let’s check what are our F# records going to be:

type Author =
     id: Guid
      name: string
      email: string
      twitter_handle: string option 

type Post =
     id: Guid
      title: string
      content: string
      authorId: Guid 

Here we just did a 1-1 record translation, more complex schemas may differ from what your application is using you can use DTO’s or anonymous records to work with these differences.

// From F# 5.0 + you can "require" NuGet packages in F# scripts
#r "nuget: Npgsql"
#r "nuget: Dapper.FSharp"

open System
open Npgsql
open Dapper.FSharp
open Dapper.FSharp.PostgreSQL
// register our optional F# types
OptionTypes.register ()

type Author =
     id: Guid
      name: string
      email: string
      twitter_handle: string option 

// register our tables
let authorTable =
    // we can use this function to match tables
    // with different names to our record definitions
    table'<Author> "authors" |> inSchema "public"

let connstring =
    "Host=192.168.100.5;Username=admin;Password=Admin123;Database=simple_fsharp"
/// In normal circunstances you would write
/// `use! conn = new NpgsqlConnection(connString)`
/// but inside F# scripts we're not allowed for top declarations like this,
/// so we use let instead
let conn = new NpgsqlConnection(connstring)

// Generate two different authors
// one with an optional handle to see how we can deal with null values
let authors =
    [  id = Guid.NewGuid()
        name = "Angel D. Munoz"
        email = "some@email.com"
        twitter_handle = Some "angel_d_munoz" 
       id = Guid.NewGuid()
        name = "Misterious Person"
        email = "mistery@email.com"
        twitter_handle = None  ]

// If you were to use ASP.NET core
// you would be running on a task or async method
task > conn.InsertAsync

    /// If all goes well you shoul'd see
    /// `Rows Affected: 2` in tour console
    printfn $"Rows Affected: %iresult"


// we're inside a script hence why we need run it synchronously
// most of the time you don't need this
|> Async.AwaitTask
|> Async.RunSynchronously

To Run this, copy this content into a file named script.fsx (or whatever name you prefer) and type:

  • dotnet fsi script.fsx

If you get a message like “warning FS3511: This state machine is not statically compilable.” don’t worry it is being tracked in https://github.com/dotnet/fsharp/issues/12038

Cool! so far we have inserted two authors to our database from our mapping, now let’s bring those folks back

#r "nuget: Dapper.FSharp"
#r "nuget: Npgsql"

open System
open Npgsql
open Dapper.FSharp
open Dapper.FSharp.PostgreSQL
// register our optional F# types
OptionTypes.register ()

type Author =
     id: Guid
      name: string
      email: string
      twitter_handle: string option 

let authorTable =
    table'<Author> "authors" |> inSchema "public"

let connstring =
    "Host=192.168.100.5;Username=admin;Password=Admin123;Database=simple_fsharp"

let conn = new NpgsqlConnection(connstring)

task 
    let! allUsers =
        select 
            for author in authorTable do
                selectAll
        
        
// we're inside a script hence why we need run it synchronously
// most of the time you don't need this
|> Async.AwaitTask
|> Async.RunSynchronously

To Run this, copy this content into a file named script.fsx (or whatever name you prefer) and type:

  • dotnet fsi script.fsx

you should see something like this:

Names:
  Angel D. Munoz
  Misterious Person
Twitter Handles:
  angel_d_munoz

Let’s check the update code, which to be honest is pretty similar, what do we update though? Our Mysterious user doesn’t have a twitter handle, so let’s add one

#r "nuget: Dapper.FSharp"
#r "nuget: Npgsql"

open System
open Npgsql
open Dapper.FSharp
open Dapper.FSharp.PostgreSQL
// register our optional F# types
OptionTypes.register ()

type Author =
     id: Guid
      name: string
      email: string
      twitter_handle: string option 

// register our tables
let authorTable =
    table'<Author> "authors" |> inSchema "public"

let connstring =
    "Host=192.168.100.5;Username=admin;Password=Admin123;Database=simple_fsharp"

let conn = new NpgsqlConnection(connstring)

task > Seq.tryHead with
    // if there is one, let's update it
    
// we're inside a script hence why we need run it synchronously
// most of the time you don't need this
|> Async.AwaitTask
|> Async.RunSynchronously

To Run this, copy this content into a file named script.fsx (or whatever name you prefer) and type:

  • dotnet fsi script.fsx

After the script is run, we should see

Users updated: 1

And if we run it for a second time we’ll see

No Users Without handle were Found

And if we run the “select” script we’ll should see the mysterious_fsharper handle

Twitter Handles:
  angel_d_munoz
  mysterious_fsharper

So far, things have been quite straight forward, but what if you don’t like the ORM style? If you like to write SQL like a real programmer (Which of course, it’s sarcasm.) or you simply like to write your SQL queries, let’s then take a look at Donald.



Donald

Donald can help us to have a 1-1 mapping with our models just like Dapper.FSharp but it needs help from our side, it is quite flexible in some aspects and tedious in others let’s see how can we add these helpers.

For the Donald scripts we will modify our Author and Post records a little bit, we will add a static function called DataReader which will take an IDataReader and return the corresponding record

#r "nuget: Donald"

open System
open System.Data
open Donald

// Same Author model from before
type Author =
     id: Guid
      name: string
      email: string
      twitter_handle: string option 
    // Add the DataReader
    static member DataReader(rd: IDataReader) : Author =
        // the reader has some functions that help us map
        // existing columns from the database and their
        // data type to our record, this can be really great
        // when you need to work on a schema you don't own
         id = rd.ReadGuid "id"
          name = rd.ReadString "name"
          email = rd.ReadString "email"
          twitter_handle = rd.ReadStringOption "twitter_handle" 

// We do the same with the Post record
type Post =
     id: Guid
      title: string
      content: string
      authorId: Guid 

    static member DataReader(rd: IDataReader) : Post =
         id = rd.ReadGuid "id"
          title = rd.ReadString "title"
          content = rd.ReadString "content"
          authorId = rd.ReadGuid "authorId" 

To Run this, copy this content into a file named script.fsx (or whatever name you prefer) and type:

  • dotnet fsi script.fsx

There are more patterns you can follow rather than attaching the static function directly to the Record, you could have a module Author = ... which contains helper functions (like the data reader) but for simplicity we will attach it right there in the record.

Donald offers two syntax styles when it comes to creating and manipulating queries:

  • Fluent Style

The fluent style is an approach based on piping functions (i.e. using |>), this is similar to other fluent APIs in other languages like db.newCommand(cmd).setParams(params).execute() these are common in C#, in the case of F# we do something like that but using individual functions and partial application in some cases

  let authorsFluent =
      conn
      |> Db.newCommand "SELECT * FROM authors WHERE twitter_handle <> @handle"
      |> Db.setParams [ "handle", SqlType.Null ]
      |> Db.query Author.DataReader

  • Expression Style

The Expression style, uses what in F# we call Computation Expressions which you already used with Dapper.FSharp! Here’s the same previous query with the expression style

  let authorsExpression =
      dbCommand conn 
          cmdText "SELECT * FROM authors WHERE twitter_handle <> @handle"
          cmdParam [ "handle", SqlType.Null ]
      
      |> Db.query Author.DataReader

They are slightly different and depending on your background one might feel more comfortable than the other Feel free to choose the one you like the best, in my case I will continue the rest of the post with the Expression based one given that we already have some expression based code from Dapper.FSharp. Previously we added some authors, let’s try to add Posts to those authors with Donald.

#r "nuget: Npgsql"
#r "nuget: Donald"

open System
open Npgsql
open Donald
open System.Data

type Author =
     id: Guid
      name: string
      email: string
      twitter_handle: string option 

    static member DataReader(rd: IDataReader) : Author =
         id = rd.ReadGuid "id"
          name = rd.ReadString "name"
          email = rd.ReadString "email"
          twitter_handle = rd.ReadStringOption "twitter_handle" 

type Post =
     id: Guid
      title: string
      content: string
      authorId: Guid 

    static member DataReader(rd: IDataReader) : Post =
         id = rd.ReadGuid "id"
          title = rd.ReadString "title"
          content = rd.ReadString "content"
          authorId = rd.ReadGuid "authorId" 

let connstring =
    "Host=192.168.100.5;Username=admin;Password=Admin123;Database=simple_fsharp"
let conn = new NpgsqlConnection(connstring)

let authorsResult =
    // let's query all of the authors
    dbCommand conn  cmdText "SELECT * FROM authors" 
    |> Db.query Author.DataReader

let authors =
    // authorsResult is a DbResult<Author list>
    // that is a helper type
    // which help us successful and failed database operations
    match authorsResult with
    // if the operation was successful return the authors
    | Ok authors -> authors
    // otherwise print to the console what failed
    // and return an empty list
    | Error err ->
        printfn "%O" err
        List.empty

let insertCommand =
  """INSERT INTO posts(id, title, content, authorId)
     VALUES(@id, @title, @content, @authorId)"""
for author in authors do
    let postId = Guid.NewGuid()
    let result =
        dbCommand conn 
            cmdText insertCommand

            cmdParam [ "id", SqlType.Guid postId
                       "title", SqlType.String $"RandomPost: postId"
                       "content", SqlType.String "This is an extremely Long Post!..."
                       "authorId", SqlType.Guid author.id ]
        
        |> Db.exec

    match result with
    | Ok () -> printfn $"Inserted post with id: postId"
    | Error err -> printfn $"Failed to insert post with id: postId... err"

To Run this, copy this content into a file named script.fsx (or whatever name you prefer) and type:

  • dotnet fsi script.fsx

At this point we should have one post for each user in our database you can run it a couple times more to insert other posts if you wish, but I think these scripts show how you can do Database operations with these libraries



Other Libraries

The F# ecosystem has several options that can appeal to developers of all kinds here are a few more that are worth looking at if you’re looking for more alternatives

RepoDB is a .NET micro ORM Database library that focuses on performance and has compatibility with many adapters

Zaid is an F# OSS Beast, Dusty tables is a simple functional wrapper on top of the SqlClient ADO.NET adapter

Zaid once again showing us the F# OSS spirit, this time with Npgsql.FSharp which is a PostgreSQL focused wrapper which has a SQL analyzer that can type verify your queries against your database at compile time!

SQLHydra provides a CLI experience for record generation from an existing database Schema plus a SQL query builder similar to Dapper.FSharp, this is a more complete solution that works for Postgres, MSSQL and SQLite.



The elephant in the Room…

EntityFramework has always been unfriendly to F# given how it relies on inheritance and mutability which isn’t bad, it is the most used ORM in C# after all but it provides some heavy friction with F#, recently C# added records to the language and even if they are not like F# ones, they can be immutable just like F# ones, so with these Records it might be on a better place but I haven’t been able to try it nor have the excitement to test it to be honest.



Bonus! Mondocks

Time for some shameless plug (with some shame)

I know, I Know… .NET is not the most friendly towards mongodb databases given how schemaless it is, and with F# it is even worse! there’s a lot of cases where you can have missing properties (which is the same as having a null or even worse some times) but if you control the Database or for some reason you need to interact with Mongo Databases (perhaps because you’re migrating from Node.js or similar) I took some time to work out on a DSL that is quite similar to Node’s MongoDB query language.

The approach with Mondocks is somewhat different, with Mondocks ideally you want to work with anonymous records to shape your data and then once you have the information do the mapping to the corresponding Record or DTO

#r "nuget: Mondocks.Net"
#r "nuget: MongoDB.Driver"

open System
open MongoDB
open MongoDB.Driver

open Mondocks.Queries
open Mondocks.Types

type Author =
     id: Guid
      name: string
      email: string
      twitter_handle: string option 

type Post =
     id: Guid
      title: string
      content: string
      authorId: Guid 

let insertCmd =
    insert "authors" 
        documents [  id = Guid.NewGuid()
                      name = "Angel D. Munoz"
                      email = "some@email.com"
                      twitter_handle = Some "angel_d_munoz" 
                     id = Guid.NewGuid()
                      name = "Misterious Person"
                      email = "mistery@email.com"
                      twitter_handle = None  ]
    

let client = MongoClient("mongodb://192.168.100.5/")

let database = client.GetDatabase("simple_fsharp")

let result =
    database.RunCommand<InsertResult>(JsonCommand insertCmd)

printfn $"Inserted: %iresult.n"

To Run this, copy this content into a file named script.fsx (or whatever name you prefer) and type:

  • dotnet fsi script.fsx

To do an update it’s a similar case, we will fetch the author first then we will update it

#r "nuget: Mondocks.Net"
#r "nuget: MongoDB.Driver"

open System
open MongoDB.Bson
open MongoDB.Driver

open Mondocks.Queries
open Mondocks.Types

type Author =
     _id: Guid
      name: string
      email: string
      twitter_handle: string 

let client = MongoClient("mongodb://192.168.100.5/")
let database = client.GetDatabase("simple_fsharp")


let findCmd =
    find "authors" 
        filter 
        limit 1
    

let result =
    database.RunCommand<FindResult<Author>>(JsonCommand findCmd)
// check on the database result set if we have an author
match result.cursor.firstBatch |> Seq.tryHead with
| Some author ->
    let updateCmd =
        update "authors" 
            // query by author _id
            updates [  q = 
                        u =
                           author with
                              // set the updated handle
                              twitter_handle = "mysterious_fsharper" 
                        multi = Some false
                        upsert = Some false
                        collation = None
                        arrayFilters = None
                        hint = None  ]
        

    let result =
        database.RunCommand<UpdateResult>(JsonCommand updateCmd)

    printfn $"Updated: %iresult.n"
| None -> printfn "No Author was found"

To Run this, copy this content into a file named script.fsx (or whatever name you prefer) and type:

  • dotnet fsi script.fsx

You will also see that you lost a lot of safety doing these kinds of queries, given the nature of MongoDB it’s hard to keep safety around it overall, however if you come from a dynamic runtime this DSL might feel a little bit more to what you’re used to, there are some rough corners but I invite you to try it and log issues, if you’re looking for an F# OSS project to dip your toes, it might be a great one 🙂

Also, you can use the usual MongoDB Driver as well you can use both side by side to be honest I made it in a way that doesn’t require you to jump out from a standard .NET Driver experience.



Closing Thoughts…

When it comes to SQL F# is a safe bet be it on the server, scripts and other environments F# can help you keep type safety between your database and your application, there are plenty of alternatives for you to try and I’m pretty sure you’ll find what fits best for you.

We’ll catch ourselves on the next time!


Source link

Learn Materialize by running streaming SQL on your nginx logs



Introduction

In this tutorial, I will show you how Materialize works by using it to run SQL queries on continuously produced nginx logs. By the end of the tutorial, you will have a better idea of what Materialize is, how it’s different than other SQL engines, and how to use it.



Prerequisites

For the sake of simplicity, I will use a brand new Ubuntu 21.04 server where I will install nginx, Materialize and mzcli, a CLI tool similar to psql used to connect to Materialize and execute SQL on it.

If you want to follow along you could spin up a new Ubuntu 21.04 server on your favorite could provider.

If you prefer running Materialize on a different operating system, you can follow the steps on how to install Materialize here:



What is Materialize

Materialize is a streaming database for real-time analytics.

It is not a substitution for your transactional database, instead it accepts input data from a variety of sources like:

  • Messages from streaming sources like Kafka
  • Archived data from object stores like S3
  • Change feeds from databases like PostgreSQL
  • Data in Files: CSV, JSON and even unstructured files like logs (what we’ll be using today.)

And it maintains the answers to your SQL queries over time, keeping them up-to-date as new data flows in (using materialized views), instead of running them against a static snapshot at a point in time.

Materialize Landing Page

If you want to learn more about Materialize, make sure to check out their official documentation here:

Materialize Documentation



Installing Materialize

Materialize runs as a single binary called materialized (d for daemon, following Unix conventions). Since we’re running on Linux, we’ll just install Materialize directly. To install it, run the following command:

sudo apt install materialized

Once it’s installed, start Materialize (with sudo so it has access to nginx logs):

sudo materialized

Now that we have the materialized running, we need to open a new terminal to install and run a CLI tool that we use to interact with our Materialize instance!

There are other ways that you could use in order to run Materialize as described here. For a production-ready Materialize instance, I would recommend giving Materialize Cloud a try!



Installing mzcli

The mzcli tool lets us connect to Materialize similar to how we would use a SQL client to connect to any other database.

Materialize is wire-compatible with PostgreSQL, so if you have psql already installed you could use it instead of mzcli, but with mzcli you get nice syntax highlighting and autocomplete when writing your queries.

To learn the main differences between the two, make sure to check out the official documentation here: Materialize CLI Connections

The easiest way to install mzcli is via pipx, so first run:

apt install pipx

and, once pipx is installed, install mzcli with:

pipx install mzcli

Now that we have mzcli we can connect to materialized with:

mzcli -U materialize -h localhost -p 6875 materialize

Materialize mzcli autocompletion

For this demo, let’s quickly install nginx and use Regex to parse the log and create Materialized Views.



Installing nginx

If you don’t already have nginx installed, install it with the following command:

sudo apt install nginx

Next, let’s populate the access log with some entries with a Bash loop:

for i in 1..200 ; do curl -s 'localhost/materialize'  > /dev/null ; echo $i ; done

If you have an actual nginx access.log, you can skip the step above.

Now we’ll have some entries in the /var/log/nginx/access.log access log file that we would be able to able to feed into Materialize.



Adding a Materialize Source

By creating a Source you are essentially telling Materialize to connect to some external data source. As described in the introduction, you could connect a wide variety of sources to Materialize.

For the full list of source types make sure to check out the official documentation here:

Materialize source types

Let’s start by creating a text file source from our nginx access log.

First, access the Materialize instance with the mzcli command:

mzcli -U materialize -h localhost -p 6875 materialize

Then run the following statement to create the source:

CREATE SOURCE nginx_log 
FROM FILE '/var/log/nginx/access.log'  
WITH (tail = true)  
FORMAT REGEX '(?P<ipaddress>[^ ]+) - - [(?P<time>[^]]+)] "(?P<request>[^ ]+) (?P<url>[^ ]+)[^"]+" (?P<statuscode>d3)';

A quick rundown:

  • CREATE SOURCE: First we specify that we want to create a source
  • FROM FILE: Then we specify that this source will read from a local file, and we provide the path to that file
  • WITH (tail = true): Continually check the file for new content
  • FORMAT REGEX: as this is an unstructured file we need to specify regex as the format so that we could get only the specific parts of the log that we need.

Let’s quickly review the Regex itself as well.

The Materialize-specific behavior to note here is the ?P<NAME_HERE> pattern extracts the matched text into a column named NAME_HERE.

To make this a bit more clear, a standard entry in your nginx access log file would look like this:

123.123.123.119 - - [13/Oct/2021:10:54:22 +0000] "GET / HTTP/1.1" 200 396 "-" "Mozilla/5.0 zgrab/0.x"
  • (?P<ipaddress>[^ ]+): With this pattern we match the IP address for each line of the nginx log, e.g. 123.123.123.119.
  • [(?P<time>[^]]+)]: the timestamp string from inside square brackets, e.g. [13/Oct/2021:10:54:22 +0000]
  • "(?P<request>[^ ]+): the type of request like GET, POST etc.
  • (?P<url>[^ ]+): the relative URL, eg. /favicon.ico
  • (?P<statuscode>d3): the three digit HTTP status code.

Once you execute the create source statement, you can confirm the source was created successfully by running the following:

mz> SHOW SOURCES;
// Output
+-----------+
| name      |
|-----------|
| nginx_log |
+-----------+
SELECT 1
Time: 0.021s

Now that we have our source in place, let’s go ahead and create a view!



Creating a Materialized View

You may be familiar with Materialized Views from the world of traditional databases like PostgreSQL, which are essentially cached queries. The unique feature here is the materialized view we are about to create is automatically kept up-to-date.

In order to create a materialized view, we will use the following statement:

CREATE MATERIALIZED VIEW aggregated_logs AS
  SELECT
    ipaddress,
    request,
    url,
    statuscode::int,
    COUNT(*) as count
  FROM nginx_log GROUP BY 1,2,3,4;

The important things to note are:

  • Materialize will keep the results of the embedded query in memory, so you’ll always get a fast and up-to-date answer
  • The results are incrementally updated as new log events arrive

Under the hood, Materialize compiles your SQL query into a dataflow and then takes care of all the heavy lifting for you. This is incredibly powerful, as it allows you to process data in real-time using just SQL.

A quick rundown of the statement itself:

  • First we start with the CREATE MATERIALIZED VIEW aggregated_logs which identifies that we want to create a new Materialized view named aggregated_logs.
  • Then we specify the SELECT statement that we are interested in keeping track of over time. In this case we are aggregating the data in our log file by ipaddress, request, url and statuscode, and we are counting the total instances of each combo with a COUNT(*)

When creating a Materialized View, it could be based on multiple sources like a stream from Kafka, a raw data file that you have on an S3 bucket, or your PostgreSQL database. This single statement will give you the power to analyze your data in real-time.

We specified a simple SELECT that we want the view to be based on but this could include complex operations like JOINs, however for the sake of this tutorial we are keeping things simple.

For more information about Materialized Views check out the official documentation here:

Creating Materialized views

Now you could use this new view and interact with the data from the nginx log with pure SQL!



Reading from the view

If we do a SELECT on this Materialized view, we get a nice aggregated summary of stats:

SELECT * FROM aggregated_logs;

   ipaddress    | request |           url            | statuscode | count
----------------+---------+--------------------------+------------+-------
 127.0.0.1      | GET     | /materialize             |        404 |   200

As more requests come in to the nginx server, the aggregated stats in the view are kept up-to-date.

We could also write queries that do further aggregation and filtering on top of the materialized view, for example, counting requests by route only:

SELECT url, SUM(count) as total FROM aggregated_logs GROUP BY 1 ORDER BY 2 DESC;

If we were re-run the query over and over again, we could see the numbers change instantly as soon as we get new data in the log as Materialize processes each line of the log and keeps listening for new lines:

+--------------------------+-------+
| url                      | total |
|--------------------------+-------|
| /materialize/demo-page-2 | 1255  |
| /materialize/demo-page   | 1957  |
| /materialize             | 400   |
+--------------------------+-------+

As another example, let’s use psql together with the watch command to see this in action.

If you don’t have psql already isntalled you can install it with the following command:

sudo apt install postgresql-client

After that, let’s run the SELECT * FROM aggregated_logs statement every second using the watch command:

watch -n1 "psql -c 'select * from aggregated_logs' -U materialize -h localhost -p 6875 materialize"

In another terminal window, you could run another for loop to generate some new nginx logs and see how the results change:

for i in 1..2000 ; do curl -s 'localhost/materialize/demo-page-2' > /dev/null ; echo $i ; done

The output of the watch command would look like this:

Materialize nginx logs demo

Feel free to experiment with more complex queries and analyze your nginx access log for suspicious activity using pure SQL and keep track of the results in real time!



Conclusion

By now, hopefully you have a hands-on understanding of how incrementally maintained materialized views work in Materialize. In case that you like the project, make sure to star it on GitHub:

https://github.com/MaterializeInc/materialize

If you are totally new to SQL, make sure to check out this free eBook here:

Free introduction to SQL basics eBook


Source link