Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudagraph] fix verbose graph logging #126694

Closed
wants to merge 4 commits into from
Closed

Conversation

youkaichao
Copy link
Collaborator

According to the doc:

enum cudaGraphDebugDotFlags
CUDA Graph debug write options

Values
cudaGraphDebugDotFlagsVerbose = 1<<0
Output all debug data as if every debug flag is enabled
cudaGraphDebugDotFlagsKernelNodeParams = 1<<2
Adds cudaKernelNodeParams to output
cudaGraphDebugDotFlagsMemcpyNodeParams = 1<<3
Adds cudaMemcpy3DParms to output
cudaGraphDebugDotFlagsMemsetNodeParams = 1<<4
Adds cudaMemsetParams to output
cudaGraphDebugDotFlagsHostNodeParams = 1<<5
Adds cudaHostNodeParams to output
cudaGraphDebugDotFlagsEventNodeParams = 1<<6
Adds cudaEvent_t handle from record and wait nodes to output
cudaGraphDebugDotFlagsExtSemasSignalNodeParams = 1<<7
Adds cudaExternalSemaphoreSignalNodeParams values to output
cudaGraphDebugDotFlagsExtSemasWaitNodeParams = 1<<8
Adds cudaExternalSemaphoreWaitNodeParams to output
cudaGraphDebugDotFlagsKernelNodeAttributes = 1<<9
Adds cudaKernelNodeAttrID values to output
cudaGraphDebugDotFlagsHandles = 1<<10
Adds node handles and every kernel function handle to output
cudaGraphDebugDotFlagsConditionalNodeParams = 1<<15
Adds cudaConditionalNodeParams to output

1 << 10 is not the most verbose flag. it is just one flag to add node handles and every kernel function handle to output. 1 << 0 is the most verbose flag, under the name cudaGraphDebugDotFlagsVerbose.

Here is an example of graph, dumped with 1 << 10:

digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="rectangle" label="0
MEM_ALLOC
node handle: 0x000055D2889750F0
"];

"graph_1_node_1"[style="bold" shape="octagon" label="1
_Z3addPhS_S_m
node handle: 0x000055D288979A20
func handle: 0x000055D288978D40
"];

"graph_1_node_2"[style="solid" shape="trapezium"label="2
MEMCPY
node handle: 0x000055D28897A130
(DtoH,1024)
"];

"graph_1_node_3"[style="solid" shape="rectangle" label="3
MEM_FREE
node handle: 0x000055D2889890C0
"];

"graph_1_node_0" -> "graph_1_node_1";
"graph_1_node_1" -> "graph_1_node_2";
"graph_1_node_2" -> "graph_1_node_3";
}
}

The same graph dumped with 1 << 0:

digraph dot {
subgraph cluster_1 {
label="graph_1" graph[style="dashed"];
"graph_1_node_0"[style="solid" shape="record" label="{
MEM_ALLOC
| {{ID | node handle} | {0 (topoId: 3) | 0x000055D2889750F0}}
| {{{poolProps | {allocType | handleTypes | {location | {type | id}}} | {PINNED | NONE | DEVICE | 0}}}}
| {{bytesize | dptr} | {1024 | 0x0000000A02000000}}
}"];

"graph_1_node_1"[style="bold" shape="record" label="{KERNEL
| {ID | 1 (topoId: 2) | _Z3addPhS_S_m\<\<\<4,256,0\>\>\>}
| {{node handle | func handle} | {0x000055D288979A20 | 0x000055D288978D40}}
| {accessPolicyWindow | {base_ptr | num_bytes | hitRatio | hitProp | missProp} | {0x0000000000000000 | 0 | 0.000000 | N | N}}
| {cooperative | 0}
| {priority | 0}
}"];

"graph_1_node_2"[style="solid" shape="record" label="{
MEMCPY
| {{ID | node handle} | {2 (topoId: 1) | 0x000055D28897A130}}
| {kind | DtoH (DEVICE to HOST PAGEABLE)}
| {{srcPtr | dstPtr} | {pitch | ptr | xsize | ysize | pitch | ptr | xsize | ysize} | {0 | 0x0000000A02000000 | 0 | 0 | 0 | 0x000055D287CA6DB0 | 0 | 0}}
| {{srcPos | {{x | 0} | {y | 0} | {z | 0}}} | {dstPos | {{x | 0} | {y | 0} | {z | 0}}} | {Extent | {{Width | 1024} | {Height | 1} | {Depth | 1}}}}
}"];

"graph_1_node_3"[style="solid" shape="record" label="{
MEM_FREE
| {{ID | node handle} | {3 (topoId: 0) | 0x000055D2889890C0}}
| {{dptr} | {0x0000000A02000000}}
}"];

"graph_1_node_0" -> "graph_1_node_1" [headlabel=0];
"graph_1_node_1" -> "graph_1_node_2" [headlabel=0];
"graph_1_node_2" -> "graph_1_node_3" [headlabel=0];
}
}

@youkaichao youkaichao requested a review from eqy as a code owner May 20, 2024 17:03
Copy link

pytorch-bot bot commented May 20, 2024

🔗 Helpful Links

🧪 See artifacts and rendered test results at hud.pytorch.org/pr/126694

Note: Links to docs will display an error until the docs builds have been completed.

✅ No Failures

As of commit ab1b4ec with merge base 8c38d0c (image):
💚 Looks good so far! There are no failures yet. 💚

This comment was automatically generated by Dr. CI and updates every 15 minutes.

@youkaichao
Copy link
Collaborator Author

@pytorchbot merge

@pytorch-bot pytorch-bot bot added the ciflow/trunk Trigger trunk jobs on your pull request label May 20, 2024
@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: Approvers from one of the following sets are needed:

  • superuser (pytorch/metamates)
  • Core Reviewers (mruberry, lezcano, Skylion007, ngimel, peterbell10)
  • Core Maintainers (soumith, gchanan, ezyang, dzhulgakov, malfet)
Details for Dev Infra team Raised by workflow job

Failing merge rule: Core Maintainers

@youkaichao
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge failed

Reason: This PR needs a release notes: label
If your changes are user facing and intended to be a part of release notes, please use a label starting with release notes:.

If not, please add the topic: not user facing label.

To add a label, you can comment to pytorchbot, for example
@pytorchbot label "topic: not user facing"

For more information, see
https://github.com/pytorch/pytorch/wiki/PyTorch-AutoLabel-Bot#why-categorize-for-release-notes-and-how-does-it-work.

Details for Dev Infra team Raised by workflow job

@youkaichao youkaichao added topic: not user facing topic category release notes: cuda release notes category and removed topic: not user facing topic category labels May 21, 2024
@youkaichao
Copy link
Collaborator Author

@pytorchbot merge

@pytorchmergebot
Copy link
Collaborator

Merge started

Your change will be merged once all checks pass (ETA 0-4 Hours).

Learn more about merging in the wiki.

Questions? Feedback? Please reach out to the PyTorch DevX Team

Advanced Debugging
Check the merge workflow status
here

@youkaichao youkaichao deleted the youkaichao-patch-1 branch May 21, 2024 00:57
@youkaichao
Copy link
Collaborator Author

Since this fix might require quite a lot time (at least several months) to be public, I have one workaround for existing pytorch versions before this PR. Please check the gist if anyone is interested.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ciflow/trunk Trigger trunk jobs on your pull request Merged open source release notes: cuda release notes category
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants