Skip to content

Introduce DbDataSource abstraction to System.Data #64812

@roji

Description

@roji

Summary

This proposes introducing a new ADO.NET abstraction called DbDataSource, which represents a data source from which (1) connections can be retrieved, and against which (2) commands can be executed directly. tl;dr:

// Provider-specific data source construction:
var dataSourceBuilder = new NpgsqlDataSourceBuilder
{
    ConnectionString = "<CONNECTION_STRING>",
    // Set authentication callbacks, auth tokens, ILoggerFactory, and any other options
};
using var dataSource = dataSourceBuilder.Build();

// Use the data source as a connection factory:
await using (var connection = await dataSource.OpenConnectionAsync())
{
    // Use the connection as usual
}

// Provider-agnostic data source construction:
using var dataSource = dbProviderFactory.CreateDataSource("<CONNECTION_STRING>");

// Get a command that executes directly against the data source, without explicitly referring to the connection
await using (var command = dataSource.CreateCommand("SELECT * FROM Blogs"))
await using (var reader = command.ExecuteReaderAsync())
{
    // Process results as usual
}

Summary of benefits

  • DbDataSource encapsulates all information and configuration needed to provide an open database connection (auth information, logging setup, type mapping config, etc.), ready for executing commands.
  • This makes it suitable for passing around and registering in DI as a factory for connections, without needing to pass around any additional information.
  • It's intended for DbDataSources to correspond to connection pools managed internally in the database driver.
    • With the current API, drivers need to look up the internal pool each time a connection is opened, by the connection string. Since DbDataSource allows user code to directly reference an (abstracted) internal pool, it eliminates that lookup and helps perf.
    • DbDataSource isn't an abstraction for a connection pool (see #24856); that would require introducing some pooling-specific APIs, is complicated, and might not be a good idea. However, it could evolve into one in the future, with external pooling implementations which can be used across ADO.NET providers.
  • It's also possible to get and execute a DbCommand directly on DbDataSource, without needing to deal with DbConnection at all. In scenarios where no connection state is required (e.g. transactions), it's not necessary to burden users with managing the connection, so this can be done under the hood. In addition, this opens up command execution models which aren't tied to a specific connection (e.g. multiplexing on Npgsql).
  • The proposal includes a shim to make the new abstraction immediately available on all ADO.NET drivers, without them needing to implement anything (unless they wish to customize behavior).

Connection instantiation and dependency injection

In ADO.NET, DbConnection implementations are typically instantiated directly, passing the connection string to the constructor:

using var connection = new NpgsqlConnection("<CONNECTION_STRING>");

This requires knowing the connection string at every site where a connection is instantiated. In addition, further authentication details are frequently needed beyond the connection string (SSL callbacks, auth tokens...), which also need to be set on the connection before it's opened. In practice, this means that user applications typically need to create some sort of factory for creating connections, to centralize this information.

Using ADO.NET with dependency injection

For example, when using ADO.NET (or Dapper) with DI, it's possible to configure a scoped connection as follows:

builder.Services.AddScoped(_ => new NpgsqlConnection(
    builder.Configuration.GetConnectionString("BloggingContext")));

However, if multiple connections are needed for any reason, then this becomes more complicated, and some sort of factory is necessary. Rather than forcing every application to build this, DbDataSource would allow the following:

builder.Services.AddSingleton(sp => NpgsqlDataSource.Create(
    builder.Configuration.GetConnectionString("BloggingContext")));

Provider-specific sugar APIs could make this nicer:

builder.Services.AddNpgsqlDataSource(
    builder.Configuration.GetConnectionString("BloggingContext"), o => o
      .UseEncryption(...)
      .SomeOtherConfig(...));

Applications can then get injected with the DbDataSource, and open connections through it. Alternatively, they can configure a scoped connection from the DbDataSource to be injected directly.

API point for connection configuration

The fact that connections are currently directly instantiated makes it difficult to add any sort of configuration APIs; the main way to tweak ADO.NET behavior is currently via connection strings, but that's not appropriate in many cases.

For example, let's suppose an ADO.NET provider wishes to add support Microsoft.Extensions.Logging. It's not possible to pass ILoggerFactory via the connection string, and while it's possible to add a DbConnection constructor overload which accepts ILoggerFactory, that would mean re-instantiating all the required loggers each time a connection is created - which isn't acceptable for perf reasons. With DbDataSource, the logging factory could be assigned directly on the provider's DbDataSource implementation, and all connection retrieved from that DbDataSource would use it.

In a similar vein, Npgsql has an extensible plugin system for modifying type mappings. This allows users to read PostgreSQL date/time values as NodaTime types, instead of as built-in .NET types such as DateTime. Similarly, database spatial types can be read as types from the NetTopologySuite library. At this point, Npgsql plugins can be enabled either globally (affecting all connections in the program), or for a specific NpgsqlConnection instance (generally inappropriate for perf reasons). It would be more appropriate to enable such plugins at the DbDataSource level, allowing multiple type mapping schemes to exist in the same applications without any perf overhead, etc.

Other possible features enabled by DbDataSource could be centrally specifying SQL statements to be prepared on all physical connections, SQL to execute on every physical connection when it's first opened (e.g. to set some state variable), etc.

It's suggested that ADO.NET drivers expose builders - e.g. NpgsqlDataSourceBuilder - which can configure all provider-specific aspects of the data source. Once a data source has been built, it should generally be immutable with respect to configuration options since connections may already have been handed out. An exception to this would probably be the rotation of authentication tokens.

DbProviderFactory

ADO.NET already contains an abstraction called DbProviderFactory, which can be used to create connections and other ADO.NET objects. However, this abstraction only serves for writing database-agnostic code that isn't specific to a certain driver:

using var connection = dbProviderFactory.CreateConnection();

The resulting DbConnection has no connection string or any other information on it; DbProviderFactory is nothing more than an abstraction for instantiating the provider-specific DbConnection implementation. It is generally a singleton type which represents the driver, as opposed to a database, a connection pool or similar.

We could theoretically retrofit DbProviderFactory with the above new APIs instead of introducing a new DbDataSource type. However, mixing the two concerns could be confusing, aswe'd then have two types of DbProviderFactory instances:

  1. Those which were created with connection information/configuration (new kind), correspond to a database and can produce open, ready-to-use connections
  2. Those which weren't (existing kind), and correspond to the driver as a whole. They can only produce blank connections which require configuration before they can be opened

As a result, I'm proposing to keep DbProviderFactory and DbDataSource separate, and to add a new CreateDataSource method on the DbProviderFactory abstraction, to allow DbDataSource to be created and used in database-agnostic code.

Executing commands directly on DbDataSource

Beyond being just a factory for connections, we can allow executing commands directly on DbDataSource:

await using var command = dataSource.CreateCommand("SELECT ...");
command.Parameters.Add(...);
var result = await command.ExecuteScalarAsync();

The main advantage here is API usability: in many (most?) scenarios, there's no need for users to concern themselves with connections - they simply want to execute a query against a database. Any scenarios which involve connection state (e.g. transaction) do require explicitly handling connections, but in the common case where no connection state is involved, we can simplify things. In typical DI scenarios, it's expected that a DbDataSource get injected, and that most commands would be executed directly against it.

Note that this isn't a Dapper-like improvement of ADO.NET: users still deal with the standard DbCommand API, add parameters and read results as usual. In fact, Dapper could add extension methods for better usability over DbDataSource just as it does over DbConnection.

A second advantage is to open up execution models which bypass the concept of a connection - this is what Npgsql does with multiplexing. In this model, commands are provided to the driver for execution on any arbitrary connection; the driver is free to coalesce unrelated commands together for better efficiency, and to continue sending commands on connections even when they're already busy, allowing better utilization of pooled connections. Executing a command directly on a DbDataSource is the right API for this kind of execution model.

To make this available on all ADO.NET providers without requiring changes, the default implementation of DbDataSource.CreateCommand would return a DbCommand wrapper over the native DbCommand returned by the driver's DbProviderFactory.CreateCommand. DbCommand.ExecuteReader (and other execution methods) would be overridden to open the connection (retrieved from DbDataSource.GetConnection) before execution, and to close it afterwards (e.g. with CommandBehavior.CloseConnection). Since the connection is owned by the command in this scenario, accessing DbCommand.Connection and Transaction should raise an exception.

Additional notes

  • When instantiating a DbConnection directly and when pooling is enabled, the ADO.NET driver must locate the internal connection pool based on the connection string. This typically involves an internal dictionary lookup on the connection string, but if any additional information can affect pooling, the situation can become more complicated. DbDataSource eliminates this perf overhead by doing any setup once at startup, and from that point the user directly references the internal pool through the abstraction.
  • DbDataSource is disposable, so that disposing may close all connections and resources associated with them, similar to the semi-standard ClearPool API that exists in SqlClient, Npgsql, Microsoft.Data.SQLite, etc.
    • A disposed DbDataSource is expected to throw ObjectDisposedException from GetConnection. However, connections which have already been handed out should continue working, and can themselves be disposed.
  • Since the provider's regular DbConnection instances are handed out by DbDataSource, users may call Close and Open on them. While this is not the recommended usage pattern, this would work and have the same effect as if the connection were instantiated directly. That is, if Close is called on a DbConnection that was acquired from a DbDataSource that represents a connection pool, the connection would be returned to the pool as usual.
  • When users instantiate two pooling DbDataSource instances with the same connection strings, those DbDataSources would typically (but not necessarily) represent two distinct pools. This isn't the case when using the shim implementation, where all connections with the same connection string are managed via the same pool.
  • Naming-wise:
    • There's a DataSource property on DbConnection, which contains an arbitrary provider-specific string representing the database of the connection. Introducing DbDataSource seems reasonably consistent with that.
    • There's also DbDataSourceEnumerator which returns a DataTable with various information on visible databases. Logically it's the same concept of a DataSource, though, so that makes sense.
  • Unfortunately we can't use covariant return types on DbDataSource, since we have async APIs returning ValueTask<T> which isn't covariant. So we use the old ADO.NET "covariance" pattern (protected virtual plus public non-virtual that gets replaced in implementations).

API proposal

public abstract class DbDataSource : IDisposable, IAsyncDisposable
{
    public abstract string ConnectionString { get; }

    protected abstract DbConnection GetDbConnection();

    protected virtual DbConnection OpenDbConnection()
    {
        var connection = GetDbConnection();
        connection.Open();

        return connection;
    }

    protected virtual async ValueTask<DbConnection> OpenDbConnectionAsync(CancellationToken cancellationToken = default)
    {
        var connection = GetDbConnection();
        await connection.OpenAsync(cancellationToken);

        return connection;
    }

    protected virtual DbCommand CreateDbCommand(string? commandText = null)
    {
        var connection = GetDbConnection();
        var command = connection.CreateCommand();

        // Return a DbCommand wrapper over connection and command, which takes care of opening/closing the connection around executions.
    }

    protected virtual DbBatch CreateDbBatch()
    {
        // Similar to CreateDbCommand...
    }

    public DbConnection GetConnection() => GetDbConnection();
    public DbConnection OpenConnection() => OpenDbConnection();
    public ValueTask<DbConnection> OpenConnectionAsync() => OpenDbConnectionAsync();
    public DbCommand CreateCommand(string? commandText = null) => CreateDbCommand();
    public DbBatch CreateBatch() => CreateDbBatch();

    public virtual void Dispose()
    {
    }

    public virtual ValueTask DisposeAsync()
        => default;
}

For provider-agnostic code, we'd add the following on System.Data.Common.DbProviderFactory:

public class DbProviderFactory
{
    public virtual DbDataSource CreateDataSource(string connectionString)
        => new DefaultDataSource(this, connectionString);
}

To make this work on existing providers, the default implementation of CreateDataSource would return an internal "shim" implementation implemented which simply gets connections from the DbProviderFactory that created it:

internal sealed class DefaultDataSource : DbDataSource
{
    private readonly DbProviderFactory _dbProviderFactory;
    private readonly string _connectionString;

    internal DefaultDataSource(DbProviderFactory dbProviderFactory, string connectionString)
        => (_dbProviderFactory, _connectionString) = (dbProviderFactory, connectionString);

    public override string ConnectionString => _connectionString;

    protected override DbConnection GetDbConnection()
    {
        var connection = _dbProviderFactory.CreateConnection();
        if (connection is null)
        {
            throw new InvalidOperationException("DbProviderFactory returned a null connection");
        }

        connection.ConnectionString = _connectionString;

        return connection;
    }

    // TODO: Command execution
}

Edit history

Date Modification
2022-02-04 Initial proposal
2022-02-07 Add note on multiple DbDatasources, connection strings and pools. Clarified that DbDataSource.ConnectionString should be non-nullable,
2022-05-05 Made DbDataSource.Dispose (and async) virtual no-op instead of abstract. Added missing public non-virtual DbDataSource.CreateCommand() and CreateBatch(). Added optional commandText parameter to CreateCommand for usability.

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions