Wednesday, February 25, 2015

The LINQ Join Method: Deciphering the Parameters

In the last few articles, I've been answering questions that have come up during my presentations of lambda expressions and LINQ, including whether we should use the methods with or without a predicate and how the OfType method actually works. To continue, we'll take a look at the "Join" method.

I'm going to spend a bit of time with "Join" because it's come up twice in the last few weeks -- once in an email and once during my presentation at the Las Vegas Code Camp last weekend. When I do my presentations, I use the fluent syntax of LINQ (also known as the method syntax where we "dot" methods together). This leads to the following question:
How do you use "Join" using the fluent syntax?
The answer is that we just need to parse the parameters of the method. When we do this, we may find ourselves using 3 different lambda expressions as parameters. So lets take things step by step.

The completed code for this can be found in the "BONUSJoinSyntax" branch of the lambdas-and-linq repository on GitHub: BONUSJoinSyntax branch. The sample is in the "JoinSyntax" project.

[Update 4/24/2015: If you'd like to see a walkthough of this on video, check out Deciphering the Join Method]

The Sample Data
When we use "Join", we combine 2 separate but related pieces of data. For this, we'll use a collection of "Person" objects and a collection of "Order" objects. Let's take a look at the objects:

From People.cs
From Orders.cs

This shows our objects. Notice that the "Order" class has a "CustomerId" property. This matches up with the "Id" property of the "Person" class. So these are the properties we will use to join our data.

The data itself comes from the "People" and "Orders" classes. Notice that each of these has a single static method that returns an IEnumerable. This is just a hard-coded list of objects that we can use in our application.

Using "Join" With the Query Syntax
We'll start by looking at how we use "Join" with the query syntax -- it's a bit easier to understand. This code is in the "OrderReports.cs" file. The idea is that we have some methods to generate reports based on our data. Then we have a console application where we display the data.

Here's a simple LINQ query that joins data from our People and Orders:


Notice that this method returns an "IEnumerable<dynamic>". We'll come back to why we're using this in a bit. Before we get to that, let's look at the body of the method.

First we call "GetPeople" and "GetOrders" to get some data to work with.

Next we create a LINQ query using the query syntax. "from p in people" says that we want to get data from the "people" collection and give it the alias "p". Then we have the join: "join o in orders" says that we want to add data from the "orders" collection, and we'll give it an alias of "o". Then we have the "on" statement. This specifies how the data is related. In this case, we say that the "Id" property of a "Person" is the same as the "CustomerId" property of an "Order". (And this is exactly what we noted above).

Finally, we have a "select" statement. For this we are using an anonymous type, which is why we have the "new" keyword without a specific type name. Inside the curly braces, we specify that we want our anonymous type to include the "LastName", "FirstName", and "OrderDate" properties.

The "var" keyword was created just for this purpose: so that we could have types without names. This is still strongly-typed, but the type only has an internal name. We can see this by putting our mouse over the "var" next to "orderDates".


This shows that "orderDates" is an "IEnumerable<T>". Right in the middle we see that "T is 'a". "'a" is the internal name that the compiler gives to our anonymous type. And at the very bottom, we can see the "'a" consists of 3 properties: LastName, FirstName, and OrderDate. (Also notice that the types for the properties are filled in as well.)

Back to the return type for the method: "IEnumerable<dynamic>". The reason I used "dynamic" is so that we can use the anonymous type in our output. Using "dynamic" is not ideal (as we'll see in just a bit), and I'd probably make this a named type in a reporting library -- but we won't worry about this today.

Running the Report
Now that we have a query, let's use it to output some data. For this, we have a simple console application in our project: "Program.cs".

Here's the code:


This calls the "OrderDatesByCustomer1" method that we just created. Then it iterates through each item and outputs it to the console.

If we put the mouse over the "var" with "item", we'll see that this is a "dynamic" object:


This means that we get absolutely zero IntelliSense on this. So inside the "foreach" loop, when we type "item." we don't get any help from Visual Studio. We have to type in "OrderDate", "FirstName", and "LastName" ourselves (and spelling and capitalization count). This is the reason why I would not use "dynamic" in a production reporting library -- it is non-discoverable, meaning we have to have intimate knowledge of the data that is coming back. (I'll update this code in a future article, but let's not let it distract us from looking at the LINQ queries.)

Here's the output when we run the console application:


We have about 20 records, and the first thing I notice is that they would benefit from being sorted. If you're curious about where this order came from, the records are in the order of the people first (which is the "outer" part of our join) and then in date order secondarily (which is the "inner" part of our join). And these are simply the order that the data comes out of the methods (not alphabetical or by date).

Let's do a few things to make this data easier to look at.

Adding Sorting and Filtering
Let's start by putting our results into order by date. That makes most sense based on the data we get back from this report. For this, we'll add an "orderby" statement to our query:


This brings some order to our results:


Next we'll add a date range filter. For this, we'll add a "where" statement:


Notice that we added "startDate" and "endDate" parameters to our method. We use these in the query with a new "where" statement to filter our data.

We need to modify our console application a bit to add the parameters. In this case, we're looking for items in the month of December 2014:


And here's the output:


Now we have a smaller set of records to work with. And this gives us a result set for us to try to match by using the fluent syntax.

Using "Join" With the Fluent Syntax
Let's try to do this same thing using the fluent LINQ syntax. In my presentation for "Learn to Love Lambdas (and LINQ, Too)", I show how to read the method signatures for several LINQ methods, including "Where", "OrderBy", and "SingleOrDefault". I won't repeat all of that here because you can see that in the downloadable materials (and in a video series that I'll be publishing in the new few weeks).

But things get a bit confusing when we look at the method signature for "Join":


This has a lot of parameters. But they are related. The first 2 parameters describe the data collections that we are joining, the next 2 parameters describe how those collections are related, and the last parameter lets us transform the data into our resulting records.

Let's step through them one at a time.

IEnumerable<TOuter> outer
This is one of our collections of data. In our case, this will be our collection of "Person" objects. So wherever we see the generic type "TOuter", we can replace it with "Person". In addition, since this is an extension method, this will actually be the type that we extend. (For more information on Extension Methods, see "Quick Byte: Extension Methods".)

Note: I'm using the word "collection" in the general sense. These are technically enumerations which can also represent calculated sequences.

IEnumerable<TInner> inner
This is the other collection that we are joining to. In our case, this will be our collection of "Order" objects. So wherever we see the generic type "TInner", we can replace it with "Order".

Key Selectors
Next, we have the key selectors. These are one of the confusing parts. But this just gives us a way to specify which properties are related. So for our "Person" (remember, this is "TOuter"), we want to use the "Id" property, and for our "Order" (the "TInner"), we want to use the "CustomerId" property.

Func<TOuter, TKey> outerKeySelector
The syntax for this is a bit strange: "Func<Person, TKey>". From working with Func, we know that this is a method that should take a "Person" as a parameter and return a "TKey". What does this mean? It means that we just need to provide a property that we can use in "equals" comparisons.

Note: If you aren't familiar with Func<T>, then you can take a look at "Get Func<>-y: Delegates in .NET".

In technical terms, this means that we need something that implements the "IEquatable" interface. If we're using standard .NET types (such as int, string, date, etc.), we don't have to do anything special; this is all built in. But if we're using a custom type, then we may need to include the interface or provide a separate equality provider. But this is usually beyond what we need to do.

Since our "Person" class is keyed off of the "Id" property (which is an integer), the actual method signature will be "Func<Person, int>" -- a method that takes a "Person" as a parameter and returns an "integer". We'll see this in a bit.

Func<TInner, TKey> innerKeySelector
To associate our "Order" class with our "Person" class, we use the "CustomerId" property (an integer) of the Order. This means that our actual method signature for this is "Func<Order, int>" -- a method that takes an "Order" as a parameter and returns an "integer".

One thing to notice is that the generic type parameter for both of our key selectors is "TKey". This means that we must use the same type (in our case, it is an integer). This makes sense since we need to compare these based on equality. If they are different types, they they couldn't be equal.

Func<TOuter, TInner, TResult> resultSelector
This last parameter is also a bit confusing. The purpose of this is to specify what we want our output to look like. Just like with our "select" statement in the LINQ query, we need to specify what we want our output to look like. And we'll use an anonymous type for our output just like we did above.

If we fill in our generic types, this means that the method signature will be "Func<Person, Order, 'a>" -- a method that takes a "Person" and an "Order" as parameters and returns an anonymous type.

Putting This All Together
Here's our initial LINQ statement using the fluent syntax:


Let's go through these parameters one at a time.

people
The first parameter is the "people" object. This is the "IEnumerable<Person>" (a.k.a. "IEnumerable<TOuter>"). Since "Join" is an extension method, our first parameter has been moved to the front.

orders
The next parameter (first inside the parentheses) is the "orders" object. This is the "IEnumerable<Orders>" (a.k.a. "IEnumerable<TInner>").

p => p.Id
Next we have our first lambda expression. Remember, this is a "Func<Person, int>" in our case. So we have a lambda expression that takes a "Person" as a parameter (our "p") and returns an integer (the "Id" property).

With lambda expressions, we can name our parameters whatever we like; however, it is very common to use single character parameter names to keep lambdas short. I use "p" to remind me that this is a "Person" object.

This syntax is very similar to what we see when we use the "OrderBy" method to specify a property for sorting. (See "Learn to Love Lambdas" for details on this.)

o => o.CustomerId
Our next lambda expression is similar to the first lambda. This one is a "Func<Order, int>". So we have a lambda expression that takes an "Order" as a parameter (our "o") and returns an integer (the "CustomerId" property).

By returning these two properties ("Id" and "CustomerId"), we let our "Join" method know how to match up the records in the different collections.

(p, o) => new { p.LastName, p.FirstName, o.OrderDate }
Our last lambda expression is what we want our resulting data to look like. As noted above, this needs to match the signature "Func<Person, Order, 'a>" in this case. So we have 2 parameters: Person and Order, and I've named these "p" and "o" respectively.

The body of the lambda is the type that we want to return. And just like when we used the "select" statement above, we use the "new" keyword to generate an anonymous type. And this type specifies that we get the "LastName" and "FirstName" properties from the "Person", and the "OrderDate" property from the "Order".

A Lot of Pieces
So, it seems that we have a lot of pieces, but if we compare this to the query syntax, we see that we have the same information, just in a bit of a different format.

Query Syntax

Fluent Syntax

Once you get comfortable with lambda expressions, the fluent syntax becomes very readable. This is one reason why teach people about lambda expressions and encourage them to use them so that they become familiar.

Using the Method
Next we'll update our console application to use the new method:


This is much the same as we have with the other method. And our output is the same as our initial output:


Let's fix this up a bit like we did with our query syntax example.

Adding Sorting and Filtering
We'll add sorting to get things into date order. One nice thing about using the fluent syntax is that we simply "dot" our methods together.

So to add sorting, we just add an "OrderBy" method call:


The "OrderBy" method needs to return a property that we can sort on. In this case, we use the "OrderDate" property. Notice that the syntax for this lambda expression is similar to what we saw with the key selectors.

This makes sense when we look at the signature for "OrderBy":


This has a "keySelector" parameter which is a "Func<TSource, TKey>" -- just like the key selectors in the "Join" parameters. The difference is that this key does a comparison (testing whether something is "greater", "lesser", or "equal") instead of an equality comparison like we have with "Join".

As a side note, I used the parameter name "r" to stand for "result" -- this is the anonymous type that is returned from the "Join" method.

When we add the "OrderBy", our results are sorted:


Finally, we'll add a filter. For this, we'll use the "Where" method. This takes a predicate like we saw when we were looking at various LINQ methods and predicates earlier.

So we'll add some parameters to our method and use them in our "Where":


The predicate for "Where" is a "Func<TSource, bool>", meaning that we take a particular type as a parameter and return a true/false value. In our case, the "TSource" is the anonymous type that is returned from our "Join".

Again, I use "r" here to stand for "result" (which is our anonymous type), and the lambda expression body returns true or false depending on whether the "OrderDate" property falls within the specified date range.

The Fluent Syntax
Once we get several of these methods together, we start to see the fluent syntax. Our call ends up as "Join(...).Where(...).OrderBy(...)".

We end up "piping" the results from one method to another. "Join" returns an IEnumerable<T>, which we use as the input for "Where". "Where" returns an IEnumerable<T>, which we use as the input for "OrderBy". And "OrderBy" returns an IOrderedEnumerable<T> which we use as our result.

As we need more functionality (such as grouping or aggregation), we simply "dot" more methods together in the order that we need.

Updated Console Application
If we update our console application to pass in the date parameters, we can run our new method. The final console application runs both of our methods: the query syntax and the fluent syntax. This lets us see the results side-by-side (sort of):


And we can see that our results are the same.

As a reminder, the code is available on GitHub: BONUSJoinSyntax branch of jeremybytes/lambdas-and-linq.

Query Syntax or Fluent Syntax?
So the question you probably have is whether you should use the query syntax or the fluent syntax. The answer is that it is entirely up to you. As mentioned previously, it's most important that you are consistent.

My personal preference has been to use the fluent syntax. This is primarily because there are many, many LINQ methods to choose from that are not available in the query syntax. This means that we need to mix query syntax with fluent syntax to get full advantage of LINQ. Because of this, I have a tendency to use the fluent syntax. This lets me stay in the mindset of "Func" and lambda expressions. And since I really love lambdas, this isn't very hard for me.

Wrap Up
The good news is that "Join" is one of the more complex LINQ methods out there. That means if you can understand how to parse the parameters for "Join", you're probably all set to parse the parameters for all of the other LINQ methods as well.

"Join" does have a lot of parameters. But as we've seen, if we take things one step at a time (and don't get frightened off), it's not very hard to implement. Although there are 5 parameters, the first 2 are the collections that we are joining, the next 2 are lambda expressions to tell how the collections are related, and the last parameter is the data transformation to the output that we want.

With this under our belt, we're well on our way to using LINQ methods effectively in our code.

Happy Coding!

4 comments:

  1. What a nice article. The examples are uncomplicated, something that seems to be impossible for most authors of this kind of stuff.

    ReplyDelete
  2. Great articles and YouTube videos on this and Await/Async and Delegates topics.
    I enjoyed them tremendously.

    I'm still however a little puzzled with the following IEnumerable extension method signature.
    How come there is only 1 TKey on line #1 and yet TKey appeared on both lines #4 and #5?
    What is the relationship between those on line #1 and those on lines #2-#6, if there is a relationship?
    (I thought they should be 1:1)

    Line #1 public static IEnumerable Join(
    Line #2 this IEnumerable outer,
    Line #3 IEnumerable inner,
    Line #4 Func outerKeySelector,
    Line #5 Func innerKeySelector,
    Line #6 Func resultSelector
    Line #7 )

    ReplyDelete
  3. sorry, my post above should have the sample method as below:
    (the < and > didn't come through properly the first try)

    Line #1 public static IEnumerable<TResult> Join<TOuter, TInner, TKey, TResult>(
    Line #2 this IEnumerable<TOuter> outer,
    Line #3 IEnumerable<TInner> inner,
    Line #4 Func<TOuter,TKey> outerKeySelector,
    Line #5 Func<TInner,TKey> innerKeySelector,
    Line #6 Func<TOuter,TInner,TResult> resultSelector
    Line #7 )

    ReplyDelete
    Replies
    1. The generic types on Line #1 (TOuter, TInner, TKey, TResult) represent the types that are used in the rest of the method signature. Each represents a different type. In the example, TKey is replaced with the integer type, so Line #4 and #5 use integer. It doesn't mean they use the same value; it means that they use the same type. For more information on Generics, you can take a look at my YouTube series: https://www.youtube.com/playlist?list=PLdbkZkVDyKZURWIWQOw2KubVRIsxdkflI

      Delete