Django QuerySet高效读取百万行数据

How to fetch millions of rows from DataBase by Django Queryset in a memory efficient way?

Below is a solution that chunks the QuerySets so they’re only keeping a small subset in memory. While this is somewhat heavier on the database (multiple queries) it seriously reduces the memory usage.

Example

import gc


def lazy_fetch_iterator(table, start_pk=0, chunk_size=1000, *args, **kwargs):
	"""
	Get the entire rows of a table by iterating over Django QuerySets ordered by the primary key.

	:param table:
	:param int start_pk:
	:param int chunk_size:
	"""
	queryset = table.objects
	end_pk = queryset.filter(*args, **kwargs).order_by('-pk').values_list('pk', flat=True).first()
	if end_pk is not None:
		while start_pk < end_pk:
			objs = queryset.filter(pk__gt=start_pk, *args, **kwargs).order_by('pk')[:chunk_size]
			for obj in objs:
				start_pk = obj.pk
				yield obj
           
      # gc.collect()  # call it if needed

Why we do not use queryset.all() to fetch millions rows from a table?

Django normally cache all the result into its memory when iterating over a huge queryset. Although in that case it’s not Django holding objects in its memory, but the database client.

Fox example, while running some cron-jobs, we may have some problems with large cron-jobs that take away too much memory.

why we do not use iterator()?

When you iterate over a QuerySet with more than one million rows, adding .iterator to your QuerySet helps somewhat, but still loads the entire query result into memory.

Note that the implementation of the iterator does not support ordered query sets.

Why we filter on the pk?

We filter on the pk, which is faster than slicing by [offset, limit] directly. For the query on a large table with a large offset is a very slow process in MySQL. When doing LIMIT 1 OFFSET 1000000 the MySQL server will fetch 1000001 rows and discard 1000000 of them.

References

本文由 络壳 原创或整理,转载请注明出处