U-SQL is a Microsoft query language that combines a declarative SQL-like syntax with C# programming, enabling it to be used to process both structured and unstructured data in big data environments. Introduced in 2015, U-SQL is part of Microsoft's Azure Data Lake Analytics cloud service, but it lets users run queries against multiple data stores in the Azure cloud.
SQL is the standard language for querying relational databases, while C# (pronounced "C-sharp") is a programming language developed by Microsoft. The company integrated them to create U-SQL in response to the emergence of big data systems, which often store unstructured data that can pose different challenges for SQL and procedural programming languages. U-SQL was designed to unify the two approaches by building native extensibility for user-written C# code into a SQL implementation.
U-SQL has its roots in a declarative and extensible scripting language called SCOPE that Microsoft uses internally. In a 2008 research paper, the company said it developed SCOPE -- short for Structured Computations Optimized for Parallel Execution -- to enable data analysts familiar with SQL to query search logs, internet clickstreams and other massive data sets that increasingly were being stored in distributed platforms like Hadoop instead of SQL-based relational databases.
The integration between SQL and C# in U-SQL is based on SCOPE, as is U-SQL's query execution and optimization framework. U-SQL's metadata system, SQL syntax and language semantics are modeled on standard ANSI SQL and Transact-SQL (T-SQL), Microsoft's implementation of the query language for its SQL Server database. However, U-SQL doesn't fully adhere to ANSI SQL; for example, commands such as SELECT must be written in uppercase letters, and C# syntax is used for expressions inside of commands.
Microsoft says U-SQL allows users to process any type of data, and at any scale. The language automatically scales queries to employ available system resources, letting users focus on querying data instead of attending to infrastructure needs or writing what Microsoft calls "plumbing code."
As a component of Microsoft's Azure Data Lake platform, U-SQL is the built-in language for analyzing data sets in the Azure Data Lake Store, which is paired with the analytics service to provide a data lake environment in the cloud. U-SQL can also be used to run queries against relational data stores that include Azure SQL Database, SQL Server's cloud-based cousin, as well as Azure SQL Data Warehouse, Azure Blob Storage and SQL Server instances that are set up in Azure virtual machines.
How to use U-SQL
U-SQL code is written as scripts that include a sequence of statements to initiate processing actions. For now, the Azure Data Lake Analytics service only supports batch processing jobs via U-SQL. As a result, queries can't return results directly; according to Microsoft, U-SQL scripts typically are structured to retrieve data from source systems in a rowset format, transform it as needed and then output the transformed data to a file or a U-SQL table for analysis.
To process sets of unstructured data in a data lake, U-SQL users can apply schema on read -- a common approach in big data systems that don't require conformance to a rigid schema, as relational databases do. Custom processing logic and a variety of user-defined functions, types, aggregates and objects -- the latter including extractors, processors, outputters and more -- can also be inserted into scripts.
The image below shows a simple U-SQL script example from Microsoft. In this case, the script doesn't include any data transformations -- it extracts data from a search log source file, applies a schema and writes the resulting rowset into a CSV file:
U-SQL scripts can be written and submitted to the Azure Data Lake Analytics service for execution in Visual Studio 2017, using a plug-in called Azure Data Lake and Stream Analytics Tools. A U-SQL extension is also available for Visual Studio Code, Microsoft's lightweight code editor; it's known simply as Azure Data Lake Tools. In addition, users can run U-SQL jobs against Azure Data Lake Analytics via the Azure portal, the Azure CLI command-line interface or Azure PowerShell.
The other data stores that can be queried with U-SQL are viewed as external data sources. Users need to run federated queries against them, a process that involves creating data source objects in U-SQL to abstract the details of how to connect to a particular data store and hand off queries to its own query engine for execution.
Microsoft says the ability to incorporate custom C# code into U-SQL scripts allows users to express complex business algorithms as part of queries. The use of C# types as a default coding style also makes it easier for users to visualize how data will be processed while writing queries, according to the company.