Getting Too Cute with C# Yield Return

I ran across a method that returned an IEnumerable<T> recently, and I implicitly typed its return value. During the course of a series of method extractions, code movement, and general refactoring, I wound up with some code that passed the various unit tests in place but failed curiously at runtime. After peering at it for a few minutes and going through once in the debugger, I traced it to a problem that you don’t see every day, and one that probably would have had me tearing my hair out if I didn’t have a good working understanding of what the “yield” keyword in C# does. So today, I’ll present the essence of this problem in the hopes that, if you weren’t aware of it, you are now.

CuteYieldReturn

Here is an entire class that contains a nested type and a couple of methods, for illustration purposes. At the bottom is a unit test that will, if you copy this into your scratchpad, fail.

public class MiscTest
{
    public class Point
    {
        public int X { get; set; }
        public int Y { get; set; }
    }

    private IEnumerable<Point> GetPoints()
    {
        for (int index = 1; index < 20; index++)
            yield return new Point() { X = index, Y = index * 2 };
    }

    private void DoubleXValue(IEnumerable<Point> points)
    {
        foreach (var point in points)
            point.X *= 2;
    }

    [TestMethod, Owner("ebd"), TestCategory("Proven"), TestCategory("Unit")]
    public void Asdf()
    {
        var points = GetPoints();
        DoubleXValue(points);
            
        Assert.AreEqual<int>(2, points.ElementAt(0).X);
    }
}

It seems pretty straightforward. You have some method that returns a bunch of points, and then you take those points and pass them to a method that iterates through them, performing an operation on each one. So what gives? Why does this fail? Everything looks pretty simple (unlike my situation, where this became removed through a few layers of indirection), and yet we get back 1 when we’re expecting 2.

To understand this, it’s important to understand what yield actually does. At its core, the yield keyword is syntactic sugar that tells the compiler to generate a state machine under the hood. Let that sink in for a moment, because it’s actually kind of a wild concept. You’re used to methods that return references to object instances or primitives or collections, but this is something fundamentally different. A method that returns an IEnumerable and does so using yield return isn’t defining a return value–it’s defining a protocol for interacting with client code.

Consider the code example above. The obvious (and, as it turns out, wrong) way to understand the GetPoints() method is, “it generates a collection of points from (1, 2) to (19, 38) and returns it.” But GetPoints() doesn’t return any such thing. In fact, it doesn’t return anything but a promise–a promise to generate points later if asked. So when we say “var points = GetPoints();” what we’re actually saying is, “the points variable references some kind of points machine that will generate points when I ask for them.”

If we think of it this way, we start to get to the bottom of what’s going wrong here. On the next line, we pass this oracle into the DoubleXValue() method. The DoubleXValue() method iterates through all of the states of the points (state) machine, retrieving points as per the promise. Once it retrieves the point, it does something to the X coordinate and then promptly discards the point. Why? Because nothing else refers to it. When you change one of the points that the points machine spits out, you’re not changing anything about the points machine–you’re not feeding it some kind of new mechanism for point generation. You could think of this as being similar to a method that takes a class factory, requests a bunch of instances from it, modifies them, and then returns. Nothing about the factory is different, and you wouldn’t expect the factory to behave differently if the caller subsequently passed it to another method.

So once the DoubleXValue() method gets done doing, well, nothing of significance, the Assert() call requests the first sequential element–the first state–from the points machine. The points machine dutifully spits out its first state, (1, 2), and the unit test fails. So how do we get it to pass? Well, here’s one way:

[TestMethod, Owner("ebd"), TestCategory("Proven"), TestCategory("Unit")]
public void Asdf()
{
    var points = GetPoints().ToList();
    DoubleXValue(points);
            
    Assert.AreEqual<int>(2, points.ElementAt(0).X);
}

Notice the added ToList() call. This is very important because it means that we’re no longer storing a reference to some kind of points machine but rather to a list of points. This line now says, “Go get me a points machine, iterate through all the states of it, and store those states locally in a list.” Now, the rest of the code behaves in a way that you’re used to because you’re storing an actual, tangible collection instead of a promise to generate a sequence.

There is no shortage of posts, documents, and articles explaining the yield return state machine concept or the idea of deferred execution. I encourage you to read those to get a better understanding of the inner mechanics and usage scenarios, respectively. But hopefully this gives you a bit of practical insight that’s easy to wrap your head around into (1) why the code behaves this way and (2) why you have to be careful of providing and consuming IEnumerables. It can be tempting to get too cute with how you provide IEnumerables or too careless with how you consume them, particularly when usage and implementation are separated by inversion of control. So be aware when using IEnumerables that you may not have a list/collection, and be aware when providing them that you’re leaving it up to your clients to decide when to get and store sequence members.

  • Steve Gilham

    It’s not just yield return that does this — anything out of a LINQ expression is similarly lazily evaluated. And if your enumeration had been something stateful, like reading bytes from a stream, or a random number generator, the second evaluation would not give the same results as the first.

    In general, data qualified as just IEnumerable, regardless of source, should be regarded as a read-once data structure — so transform it through LINQ to your heart’s content, but reify it as an array or a list before handing it on.

  • http://www.daedtech.com/blog Erik Dietrich

    Good point about the broader based applicability and yield returning something beyond the creation scope of the method. The inspiration for this particular post started out as “this is something specific that happened and here’s why,” but there are certainly more far-reaching complexities with the deferred evaluation paradigm.

  • Timothy Boyce

    Deferred execution can certainly cause some problems if you aren’t careful. ReSharper is great at warning you about most cases where there could be a problem. When I pasted in your code, it warned me about possible multiple enumerations of an IEnumerable.

  • James Curran

    The ToList() is merely a band-aid. The problem is with DoubleXValue(), which modifies that values, and then throws them away. The “correct” solution would be:

    var points = GetPoints();

    points = DoubleXValue(points);
    // :
    // :

    private IEnumerable DoubleXValue(IEnumerable points)

    {

    foreach (var point in points)
    {
    point.X *= 2;
    yield return point;
    }

    }

    Alternately:

    private IEnumerable DoubleXValue(IEnumerable points)
    {
    return points.Select(p=> new Point {X = p.X * 2, Y = p.Y});

    }

    or we could componentize it:

    private Point DoubleXValue(Point p) { return new Point { X= p.X * 2, Y = p.Y};}
    // :

    //:
    var points = GetPoints().Select(DoubleXValue);

  • http://www.daedtech.com/blog Erik Dietrich

    That’s really cool. Another piece of feature envy that I have for R#. Fingers crossed that it makes the Code Rush issues list in an upcoming release.

  • http://www.daedtech.com/blog Erik Dietrich

    The ToList() call was purely instructional — to highlight the difference between storing a deferred execution enumerable as a local and storing the list resulting from walking the enumeration (I thought that would be the best way to contrast them). I definitely like your solution with the return enumeration that also uses yield return — that’s what I wound up doing in the actual code that inspired this post :)

  • http://twitter.com/jcdickinson Jonathan C Dickinson

    This does have quite a bit to do with yield, agreed – but I think it’s also about understanding pointers correctly (pointers in C# you exclaim? Yes guys, reference types are pointers).

  • http://www.tonicodes.net/blog/ Toni Petrina

    R# pointed out immediately that the enumeration is enumerated multiple times, a general no-no :)

  • http://gettingsharper.de/ Carsten König

    welll this is what you get if you mix “side effects” with struff from functional programming … you see: just don’t mess with this stuff (use immutable data and pure functions) and you would not run into trouble …

  • Michael Paterson

    What is the Code Rush issue?

  • James Curran

    Reference types are IMPLEMENTED AS pointers (but as is the case with all of OO design — Implementation Is Irrelevant)

  • James Curran

    It’s the “issue list” (bug reports and feature requests) for Code Rush (Developers’ Express’s alternative to Resharper)

  • http://twitter.com/trawk Justin

    Part of the problem is use of the ‘var’ keyword masking types. We are so comfortable with ‘Lists are IEnumerables’ and treating them interchangeably as such, but if you actually had to write IEnumerable as the declared type of a variable, that should immediately give you pause to think very carefully about what you’re doing.

  • http://www.daedtech.com/blog Erik Dietrich

    I can’t speak for anyone else, but I’m not sure if the act of typing the type (as opposed to using CodeRush to flip between explicit/implicit or hovering the mouse over var) would really have an effect on my thinking. Typing the first “Foo” in “Foo foo = GetFoo()” doesn’t really engage my brain to think of the ramifications of the type — it’s just noise. That said, if I’m reading someone else’s code (or leaving this code for someone else I suppose), I see your point — you have a better piece of self-documenting code for someone who understands enumerations to say “careful how you use this.”

  • http://www.daedtech.com/blog Erik Dietrich

    Agreed. That’s the approach I take and prefer to take in reality here, myself. Unfortunately, we don’t always have complete control over the APIs and libraries that we use…. :(

  • Pingback: Liens de la semaine – #15 | frenchcoding

  • http://twitter.com/jcdickinson Jonathan C Dickinson

    Actually implementation is not irrelevant, hence the reason for this blog post. A developer needs to understand that passing reference values around is passing the same piece of memory around. Making a toy OO system in plain ol’ C is a must for any developer (even if it lands up being bad, leaky and whatnot). You need to **understand** the systems that lie underneath your abstraction level, so that you don’t get bitten by issues like this one (and potentially waste time with them).

  • Pingback: Detecting IEnumerable “State Machines” | Click & Find Answer !

  • http://www.michielstaessen.be/ Michiel Staessen

    Working with IEnumerable and yield return can be tricky and one should indeed understand the mechaniscs of deferred execution.

    I experienced this yesterday. I started with .NET only a couple of months ago. I come from Java, so for me, yield return is quite “magical” in the awesome kind of way. I started playing around with it and used it in a performance test where I need to do a nested iteration of 15M and 80 entities. Running the test took very, very long (I started with a smaller number of entities) and I had no clue what was going on.

    Apparently, .NET does not cache the instances that are returned on yield return. Hence, for every iteration in the enumeration, the elements were created again and again in the nested enumeration. Memory consumption was very low (this is the advantage of the state machine), but computational power was wasted.

    The solution was very simple. Just force ToList() on your collection to point to a list instead of to a state machine as explained in this post. This increases the memory usage (because you need to keep the list in memory instead of generating the elements you need on the fly) but avoids unnecessary computation in my case.

    My advice: use yield return with caution and make sure not to do heavy computational stuff inside it (just like you should not do it in properties) because the CLR expects it to return about instantaneously. If you are doing complex computational stuff, force ToList() on your yield return collection and consider getting more memory if you have to deal with large lists :).

  • http://www.daedtech.com/blog Erik Dietrich

    Hi Michael,

    Thanks for reading. Like you, I came to C# from Java (and C/C++ before that), but some years ago now, back when C# current version was 2.0. My personal impression over these years has been to fall in love with C# since it seems to be identical to Java but time-warped about 2 years in the future. I believe Java just recently introduced lambdas and closures with Java 1.7, whereas C# has had these build in since 3.0 a few years ago, IIRC.

    Your tale does seem to serve as a good cautionary tale for transplants from other languages, since this concept of IEnumerable and deferred execution doesn’t exist out of the box in any other language I’ve worked with. In other words, you are completely correct that C# (or .NET) does not ‘cache’ the object instances that are popped in IEnumerable iterator. But then, that’s not the point, which it seems you understand.

    I might suggest that you consider to whom the burden of a performance promise should fall. What I mean is that if you’re doing yield return, all you’re doing is giving a client of your code an assurance that you can provide them with correct objects. You’re making no promises as to how you’re going to do that and how long it’s going to take. A lot of times, you legitimately don’t know (such as retrieving items from a database using IQueryable). So rather than going from deferred to up-front execution “behind the scenes,” I’d think about just returning a list or collection type if you want to make an implied performance promise.
    To put it more concretely, if you were interested in a few random New Yorker phone numbers, I could either return you a URL for a web service that would provide you with the phone numbers one by one, in alphabetical order, or I could just send over a giant list of names and phone numbers. The former is a method that returns IEnumerable of contact info, and the latter is a list of contact info. I’d probably not opt to “force” the former scheme to be the latter. In other words, I wouldn’t tell you that I was delivering you a list of phone numbers but actually just call the web service for every New Yoker and send over the finished product, pretending my IEnumerable return value was really an IList.

    I’d advocate being true to rerturn type interfaces. If I call a method that returns IEnumerable, I should be well versed enough to realize that there’s a good chance I’m soliciting deferred execution. If I want a collection or a list, I should find a method with that return type.

  • http://www.michielstaessen.be/ Michiel Staessen

    Using other return types than IEnumerable is indeed the best solution. It is also a more specific contract for your code. In Java, I would have never used the Collection interface (Java’s equivalent for IEnumerable) as a return type but rather used the List or Set interface. Seems like I should correct myself and start using IList and ISet instead… :)

  • Pingback: What To Return: IEnumerable or IList? | DaedTech