Skip to content

Conversation

@nicelhc13
Copy link
Contributor

This PR invokes an LRU-based GC for each task execution. Cholesky on various setting worked correctly. This PR is not optimized well performance yet, but aims to implement a baseline. This PR includes possible bug fixes on the runtime.

@nicelhc13 nicelhc13 requested review from wlruys and yinengy January 19, 2023 19:05
self._device_launched_datamove_task_counts = {
dev: 0 for dev in self._available_resources.get_resources()}

self._lrum = LRUManager()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just looking through this bc we'll be looking at it again soon and been thinking about memory counting.
It should be a separate LRUManager for each device object (memory space), shouldn't it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on the design. My original idea was to have a single GC (not necessarily now but I thouhgt a daemon like a scheduler thread) and manage all memory regardless of its location. It might be simpler than a GC per a device? Could you please let me know your idea and its pros?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason we would call into the EvictionPolicy is when a specific device is full and we want to bring more data onto it. This would look at only data on that device to start evicting (which means it needs a device specific hash-list structure for LRU). I am not sure how this would be handled by a global one for all devices without evicting data from other devices unnecessarily.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also (just terminology), specifically I'm using "GC" for cleaning up data that will not be used in the future and "Eviction" for kicking out any data (possibly data that will be used in the future) due to space limitations on a device.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am on board with you. But I think there is a different way to implement what you said. What I meant was the global memory manager can still manage the whole devices. For example, as you said, we maintain a list of arrays for each device and the global one can traverse and evicts/dellocates subset of them to get more memory. Or we can have a dedicated memory manager thread for each device and let each of them manage the owned device parrays. I think this is implementation detail?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants