Bucketizing – A simple approach for solving hidden memory issues

Ori Pearl
By Ori Pearl
Ori Pearl

Ori Pearl

A developer through and through, Ori has been coding since age nine. Prior to Tipalti, he was with the Israel Defense Forces for nearly six years. He began as a software engineer, was promoted to Technical Lead, and eventually became the Head of R&D as an engineering group manager. He joined Tipalti in 2013 as a software developer and has since emerged with increasing responsibilities as team lead, Director of R&D, Vice President of Engineering, and now Senior Vice President of Engineering reporting to the CEO. Ori holds a B.Sc. degree in Computer Science from the Technion, Israel Institute of Technology.

Follow

Updated December 20, 2024

Sometimes, seemingly simple loops may hide memory consumption bugs. Let’s look at the following C# code snippet that’s responsible for doing maintenance on a list of users.

  1. long[] userIds = GetUserIdsForMaintenance();
  2. using(DbContext dbContext = newDbContext())
  3. {
  4.     foreach(longid inuserIds)
  5.     {
  6.         User user = dbContext.GetUser(id);
  7.         // ... Do maintenance on user ...
  8.     }
  9. }

As implied, each dbContext.GetUser(id) creates a DB call that fetches a User. Many popular O/R Mapping frameworks, such as Entity Framework or NHibernate, utilize a caching mechanism when fetching entities from the DB, so in our example all the fetched Users might be cached by the framework in its first-level cache (More about first-level caching: Entity FrameworkNHibernate).

When our userIds list is very long, this cache can quickly fill up to a point where we run out of memory and receive an OutOfMemoryException.

How Bucketizing can help memory issues

One way to avoid these memory issues without turning off the caching feature is to periodically clear the cache before it fills up.

An easy way to do that would be to split our userIds into buckets and for each bucket to initialize a new DbContext instance:

  1. IEnumerable userIds = dbContext.GetAllUserIds();
  2. foreach (IEnumerable idBucket in userIds.Bucketize(5000))
  3. {
  4. using (DbContext dbContext = new DbContext())
  5. {
  6. foreach (long id in idBucket)
  7. {
  8. User user = dbContext.GetUser(id);
  9. // … Do maintenance on user …
  10. }
  11. }
  12. }

What we see here is a new extension method called Bucketize that splits the long userId list into buckets, each containing 5,000 IDs.

When handling each bucket, we are creating a new instance of DbContext. This effectively clears the cache of the old DbContext instances by letting the garbage collector collect the entire object and free all of its memory.

What does Bucketize code looks like?

  1. publicstaticIEnumerable> Bucketize(thisIEnumerable vals, intbucketSize)
  2. {
  3.     varcurrentList = newList();
  4.     foreach(varelement invals)
  5.     {
  6.         if(currentList.Count == bucketSize)
  7.         {
  8.             yieldreturncurrentList;
  9.             currentList = newList();
  10.         }
  11.         currentList.Add(element);
  12.     }
  13.     if(currentList.IsEmpty())
  14.     {
  15.         yieldbreak;
  16.     }
  17.     yieldreturncurrentList;
  18. }

As you can see, Bucketize is an extension method for IEnumerable which utilizes the yield keyword in order to retrieve the next bucket when needed, and not iterate on the entire collection.

“Bucketizing” large data collections can help us overcome memory issues that are sometimes hidden behind seemingly simple-looking loops.

Recommendations

You may also like