A Deep Dive Into Entity Framework Performance When Using “Contains”

Entity Framework is a great tool, but in some cases its performance is slow. One such case arises when complex queries use “Contains.”

Join Toptal .​NET Developer Anton Shkuratov in exploring how performance issues can be mitigated and finding the best approach to this type of complex queries.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Entity Framework is a great tool, but in some cases its performance is slow. One such case arises when complex queries use “Contains.”

Join Toptal .​NET Developer Anton Shkuratov in exploring how performance issues can be mitigated and finding the best approach to this type of complex queries.


Toptalauthors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.
Anton Shkuratov
Verified Expert in Engineering
17 Years of Experience

Anton is a software developer and technical consultant with 10+ years of experience in desktop-distributed applications.

Expertise

Share

During my everyday work, I use Entity Framework. It is very convenient, but in some cases, its performance is slow. Despite there being lots of good articles on EF performance improvements, and some very good and useful advice is given (e.g., avoid complex queries, parameters in Skip and Take, use views, select needed fields only, etc.), not that much can really be done when you need to use complex Contains on two or more fields—in other words, when you join data to a memory list.

Problem

Let’s check the following example:

var localData = GetDataFromApiOrUser();
var query = from p in context.Prices
            join s in context.Securities on 
              p.SecurityId equals s.SecurityId
            join t in localData  on 
              new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
              new { t.Ticker, t.TradedOn, t.PriceSourceId }
            select p;
var result = query.ToList();

The code above doesn’t work in EF 6 at all, and while it does work in EF Core, the join is actually done locally—since I have ten million records in my database, all of them get downloaded and all the memory gets consumed. This is not a bug in EF. It is expected. However, wouldn’t it be fantastic if there was something to solve this? In this article, I’m going to do some experiments with a different approach to work around this performance bottleneck.

Solution

I’m going to try different ways to achieve this starting from the most simple to more advanced. On each step, I will provide code and metrics, such as time taken and memory usage. Note that I will interrupt the benchmarking program run if it works longer than ten minutes.

Code for the benchmarking program is located in the following repository. It uses C#, .NET Core, EF Core, and PostgreSQL. I used a machine with Intel Core i5, 8 GB RAM, and an SSD.

The DB schema for testing looks like this:

Tables in the database: prices, securities and price sources

Just three tables: prices, securities and price sources. The prices table has tens of millions of records.

Option 1. Simple and Naive

Let’s try something simple, just to get started.

var result = new List<Price>();
using (var context = CreateContext())
{
  foreach (var testElement in TestData)
  {
    result.AddRange(context.Prices.Where(
      x => x.Security.Ticker == testElement.Ticker &&
           x.TradedOn == testElement.TradedOn &&
           x.PriceSourceId == testElement.PriceSourceId));
  }
}

The algorithm is simple: For each element in the test data, find a proper element in the database and add it to the result collection. This code has just one advantage: It is very easy to implement. Also, it’s readable and maintainable. Its obvious drawback is that it is the slowest one. Even though all three columns are indexed, the overhead of network communication still creates a performance bottleneck. Here are the metrics:

Results of the first experiment

So, for a large volume, it takes approximately one minute. Memory consumption seems to be reasonable.

Option 2. Naive with Parallel

Now let’s try to add parallelism to the code. The core idea here is that hitting the database in parallel threads can improve the overall performance.

var result = new ConcurrentBag<Price>();
var partitioner = Partitioner.Create(0, TestData.Count);

Parallel.ForEach(partitioner, range =>
{
  var subList = TestData.Skip(range.Item1)
                        .Take(range.Item2 - range.Item1)
                        .ToList();
  using (var context = CreateContext())
  {
    foreach (var testElement in subList)
    {
      var query = context.Prices.Where(
            x => x.Security.Ticker == testElement.Ticker &&
                 x.TradedOn == testElement.TradedOn &&
                 x.PriceSourceId == testElement.PriceSourceId);
      foreach (var el in query)
      {
        result.Add(el);
      }
    }
  }
});

It is interesting that, for smaller test data sets, this approach works slower than the first solution, but for bigger samples, it is faster (approx. 2 times in this instance). Memory consumption changes a bit, but not significantly.

Results of the second experiment

Option 3. Multiple Contains

Let’s try another approach:

  • Prepare 3 collections of unique values of Ticker, PriceSourceId, and Date.
  • Perform the query with one run filtering by using 3 Contains.
  • Recheck locally (see below).
var result = new List<Price>();
using (var context = CreateContext())
{
  var tickers = TestData.Select(x => x.Ticker).Distinct().ToList();
  var dates = TestData.Select(x => x.TradedOn).Distinct().ToList();
  var ps = TestData.Select(x => x.PriceSourceId)
                   .Distinct().ToList();

  var data = context.Prices
               .Where(x => tickers.Contains(x.Security.Ticker) &&
                           dates.Contains(x.TradedOn) &&
                           ps.Contains(x.PriceSourceId))
               .Select(x => new { 
                           x.PriceSourceId,
                           Price = x, 
                           Ticker = x.Security.Ticker, 
                      })
                .ToList();

  var lookup = data.ToLookup(x => 
     $"{x.Ticker}, {x.Price.TradedOn}, {x.PriceSourceId}");

  foreach (var el in TestData)
  {
    var key = $"{el.Ticker}, {el.TradedOn}, {el.PriceSourceId}";
    result.AddRange(lookup[key].Select(x => x.Price));
  }
}

This approach is problematic. Execution time is very data dependent. It may retrieve just the required records (in which case it will be very fast), but it might return many more (maybe even 100 times more).

Let’s consider the following test data:

Response data

Here I query the prices for Ticker1 traded on 2018-01-01 and for Ticker2 traded on 2018-01-02. However, four records will actually be returned.

The unique values for Ticker are Ticker1 and Ticker2. The unique values for TradedOn are 2018-01-01 and 2018-01-02.

So, four records match this expression.

That’s why a local recheck is needed and why this approach is dangerous. The metrics are as follows:

Results of the third experiment

Awful memory consumption! Tests with big volumes failed due to a timeout of 10 minutes.

Option 4. Predicate Builder

Let’s change the paradigm: Let’s build a good old Expression for each test data set.

var result = new List<Price>();
using (var context = CreateContext())
{
  var baseQuery = from p in context.Prices
                  join s in context.Securities on 
                    p.SecurityId equals s.SecurityId
                  select new TestData()
                  {
                    Ticker = s.Ticker,
                    TradedOn = p.TradedOn,
                    PriceSourceId = p.PriceSourceId,
                    PriceObject = p
                  };

  var tradedOnProperty = typeof(TestData).GetProperty("TradedOn");
  var priceSourceIdProperty =
    typeof(TestData).GetProperty("PriceSourceId");
  var tickerProperty = typeof(TestData).GetProperty("Ticker");

  var paramExpression = Expression.Parameter(typeof(TestData));
  Expression wholeClause = null;
  foreach (var td in TestData)
  {
    var elementClause = 
      Expression.AndAlso(
        Expression.Equal(
          Expression.MakeMemberAccess(
            paramExpression, tradedOnProperty),
          Expression.Constant(td.TradedOn)
        ),
        Expression.AndAlso(
          Expression.Equal(
            Expression.MakeMemberAccess(
              paramExpression, priceSourceIdProperty),
            Expression.Constant(td.PriceSourceId)
          ),
          Expression.Equal(
            Expression.MakeMemberAccess(
              paramExpression, tickerProperty), 
              Expression.Constant(td.Ticker))
          ));

    if (wholeClause == null)
      wholeClause = elementClause;
    else
      wholeClause = Expression.OrElse(wholeClause, elementClause);
  }

  var query = baseQuery.Where(
  (Expression<Func<TestData, bool>>)Expression.Lambda(
     wholeClause, paramExpression)).Select(x => x.PriceObject);

  result.AddRange(query);
}

The resulting code is pretty complex. Building expressions is not the easiest thing and involves reflection (which, itself, is not that fast). But it helps us to build a single query using lots of … (.. AND .. AND ..) OR (.. AND .. AND ..) OR (.. AND .. AND ..) .... These are the results:

Results of the fourth experiment

Even worse than either of the previous approaches.

Option 5. Shared Query Data Table

Let’s try one more approach:

I added a new table to the database that will hold query data. For each query I can now:

  • Start a transaction (if not yet started)
  • Upload query data to that table (temporary)
  • Perform a query
  • Roll back a transaction—to delete uploaded data
var result = new List<Price>();
using (var context = CreateContext())
{
  context.Database.BeginTransaction();

  var reducedData = TestData.Select(x => new SharedQueryModel()
  {
    PriceSourceId = x.PriceSourceId,
    Ticker = x.Ticker,
    TradedOn = x.TradedOn
  }).ToList();
  
  // Here query data is stored to shared table
  context.QueryDataShared.AddRange(reducedData);
  context.SaveChanges();

  var query = from p in context.Prices
         join s in context.Securities on 
           p.SecurityId equals s.SecurityId
         join t in context.QueryDataShared on 
           new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
           new { t.Ticker, t.TradedOn, t.PriceSourceId }
         select p;
  result.AddRange(query);

  context.Database.RollbackTransaction();
}

Metrics first:

Results of the fifth experiment

The result is very good. Very fast. Memory consumption is also good. But the drawbacks are:

  • You have to create an extra table in the database to perform just one type of query,
  • You have to start a transaction (which consumes DBMS resources anyways), and
  • You have to write something to the database (in a READ operation!)—and basically, this will not work if you use something like read replica.

But apart from that, this approach is nice—fast and readable. And a query plan is cached in this case!

Option 6. MemoryJoin Extension

Here I’m going to use a NuGet package called EntityFrameworkCore.MemoryJoin. Despite the fact that its name has the word Core in it, it also supports EF 6. It is called MemoryJoin, but in fact, it sends the specified query data as VALUES to the server and all work is done on the SQL server.

Let’s check the code.

var result = new List<Price>();
using (var context = CreateContext())
{
  // better to select needed properties only, for better performance
  var reducedData = TestData.Select(x => new { 
    x.Ticker, 
    x.TradedOn, 
    x.PriceSourceId 
  }).ToList();

  var queryable = context.FromLocalList(reducedData);
  var query = from p in context.Prices
              join s in context.Securities on 
                p.SecurityId equals s.SecurityId
              join t in queryable on 
                new { s.Ticker, p.TradedOn, p.PriceSourceId } equals
                new { t.Ticker, t.TradedOn, t.PriceSourceId }
              select p;

  result.AddRange(query);
}

Metrics:

Results of the final experiment

This looks awesome. Three times faster than the previous approach—that makes it the fastest yet. 3.5 seconds for 64K records! The code is simple and understandable. This works with read-only replicas. Let’s check the query generated for three elements:

SELECT "p"."PriceId",
       "p"."ClosePrice",
       "p"."OpenPrice",
       "p"."PriceSourceId",
       "p"."SecurityId",
       "p"."TradedOn",
       "t"."Ticker",
       "t"."TradedOn",
       "t"."PriceSourceId"
FROM "Price" AS "p"
INNER JOIN "Security" AS "s" ON "p"."SecurityId" = "s"."SecurityId"
INNER JOIN
  ( SELECT "x"."string1" AS "Ticker",
           "x"."date1" AS "TradedOn",
           CAST("x"."long1" AS int4) AS "PriceSourceId"
   FROM
     ( SELECT *
      FROM (
            VALUES (1, @__gen_q_p0, @__gen_q_p1, @__gen_q_p2), 
                   (2, @__gen_q_p3, @__gen_q_p4, @__gen_q_p5), 
                   (3, @__gen_q_p6, @__gen_q_p7, @__gen_q_p8)
           ) AS __gen_query_data__ (id, string1, date1, long1)
       ) AS "x"
   ) AS "t" ON (("s"."Ticker" = "t"."Ticker")
AND ("p"."PriceSourceId" = "t"."PriceSourceId")

As you can see, this time actual values are passed from memory to the SQL server in the VALUES construction. And this does the trick: The SQL server managed to perform a fast join operation and use the indexes correctly.

However, there are some drawbacks (you may read more on my blog):

  • You need to add an extra DbSet to your model (however no need to create it in the DB)
  • The extension doesn’t support model classes with many properties: three string properties, three date properties, three guide properties, three float/double properties, and three int/byte/long/decimal properties. This is more than enough in 90% of the cases, I guess. However, if it’s not, you can create a custom class and use that. So, HINT: you need to pass the actual values in a query, otherwise resources are wasted.

Conclusion

Among the things I’ve tested here, I’d definitely go for MemoryJoin. Someone else might object that the drawbacks are insurmountable, and since not all of them can be resolved at the moment, we should abstain from using the extension. Well, for me, it’s like saying that you should not use a knife because you could cut yourself. Optimization was a task not for junior developers but for someone who understands how EF works. To that end, this tool can improve performance dramatically. Who knows? Maybe one day, someone at Microsoft will add some core support for dynamic VALUES.

Finally, Here are a few more diagrams to compare results.

Below is a diagram for time taken to perform an operation. MemoryJoin is the only one which does the job in a reasonable time. Only four approaches can process big volumes: two naive implementations, shared table, and MemoryJoin.

Time taken in various cases for each experiment

The next diagram is for memory consumption. All approaches demonstrate more or less the same numbers except the one with multiple Contains. This phenomenon was described above.

Memory consumption in various cases  for each experiment

Understanding the basics

  • What is DBset in Entity Framework?

    DBSet is an abstraction which is literally a collection of objects (usually lazy loaded) stored in a table. Operations done on DBSet are actually performed on the actual database records via SQL queries.

  • What does Entity Framework do?

    Entity Framework is an object relational mapping framework, which provides a standard interface for accessing data stored in relational databases (of different vendors).

  • What is code first approach in Entity Framework?

    The code-first approach means that the developer creates model classes first before the actual DB is created. One of the biggest advantages is storing the database model in source control systems.

Hire a Toptal expert on this topic.
Hire Now
Anton Shkuratov

Anton Shkuratov

Verified Expert in Engineering
17 Years of Experience

Tomsk, Tomsk Oblast, Russia

Member since December 8, 2014

About the author

Anton is a software developer and technical consultant with 10+ years of experience in desktop-distributed applications.

authors are vetted experts in their fields and write on topics in which they have demonstrated experience. All of our content is peer reviewed and validated by Toptal experts in the same field.

Expertise

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

World-class articles, delivered weekly.

Subscription implies consent to our privacy policy

Join the Toptal® community.