Skip to content

Commit ad6f513

Browse files
authored
Update how_to_debug_python.md
@Rperry2174 The flow of information was very good. I was able to walk through this guide to better understand the flame graph for my Go application. Fixed some wording, punctuation, and grammar. Biggest change was "Flamegraph" to "flame graph" (confirmed spelling and capitalization on Gregg's website). GH doesn't have spell check so I would go through one more time and look for those minor spelling errors. I might add something about how your flame graph model is inverted and how that increases readability since the functions with the highest CPU usage are right at the top instead of at the bottom. LMK if you have any questions are comments about the edits. -KP
1 parent 0ffb1ee commit ad6f513

File tree

1 file changed

+23
-22
lines changed

1 file changed

+23
-22
lines changed

examples/how_to_debug_python.md

Lines changed: 23 additions & 22 deletions
Original file line numberDiff line numberDiff line change
@@ -1,44 +1,45 @@
11
# How to Debug Performance Issues in Python
2-
#### Using Flamgraphs to get to the root of the problem
2+
#### Using flame graphs to get to the root of the problem
33

4-
I know from personal experience that debugging performance issues on Python servers can be incredibly hard. Usually, there was some event like increased traffic or a transient bug that causes end users to report that somethings wrong.
4+
I know from personal experience that debugging performance issues on Python servers can be incredibly frusturating. Usually, increased traffic or a transient bug would cause end users to report that something was wrong.
55

6-
More often than not, its _impossible_ to exactly replicate the conditions under which the bug occured and so I was stuck trying to figure out which part of our code/infrastructure is responsible for this performance issue on our server.
6+
More often than not, it's _impossible_ to exactly replicate the conditions under which the bug occured, and so I was stuck trying to figure out which part of our code/infrastructure was responsible for the performance issue on our server.
77

8-
This article explains how to use Flamegraphs to continuously monitor you code and show you exactly which lines are responsible for these performance issues.
8+
This article explains how to use flame graphs to continuously profile your code and reveal exactly which lines are responsible for those pesky performance issues.
99

1010
## Why You should care about CPU performance
11-
CPU performance is one of the main indicators used by pretty much every company that runs their software in the cloud (i.e. on AWS, Google Cloud, etc).
11+
CPU utilization is a metric of application performance commonly used by companies that run their software in the cloud (i.e. on AWS, Google Cloud, etc).
1212

13-
In fact, Netflix performance architect, Brendan Gregg, mentioned that decreasing CPU usage even just 1% is seen as an enormous improvement because of the resource savings that occur at that scale. However, smaller companies also see similar benefits when improving performance, because regardless of size, CPU is often directly correlated with two very important facets of a software business:
14-
1. How much money you're spending on servers - The more CPU resources you need the more it costs to run servers
15-
2. End-user experience - The more load that is placed on your servers CPUs, the slower your website or server becomes
13+
In fact, Netflix performance architect Brendan Gregg mentioned that decreasing CPU usage by even 1% is seen as an enormous improvement because of the resource savings that occur at that scale. However, smaller companies can see similar benefits when improving performance because regardless of size, CPU is often directly correlated with two very important facets of running software:
14+
1. How much money you're spending on servers - The more CPU resources you need, the more it costs to run servers
15+
2. End-user experience - The more load placed on your server's CPUs, the slower your website or server becomes
1616

17-
So, when you see a graph that looks like this, which is what you typically see in a tool like AWS that shows high-level metrics:
17+
So when you see a graph of CPU utilization that looks like this:
1818
![image](https://user-images.githubusercontent.com/23323466/105274459-1a341980-5b52-11eb-9807-cf91351d9bf2.png)
1919

20-
You can, assume that during this period of 100% CPU utilization:
21-
- your end-users are likely having a diminished experience (i.e. App / Website is loading slow)
22-
- your server costs are going to increase because you need to provision new servers to handle the increased load
20+
During the period of 100% CPU utilization, you can assume:
21+
- End-users are having a frusturating experience (i.e. App / Website is loading slow)
22+
- Server costs will increase after you provision new servers to handle the additional load
2323

24-
But, the main problem is that you don't know _why_ these things or happening. **Which part of the code is responsible?** That's where Flamegraphs come in.
24+
The question is: **which part of the code is responsible for the increase in CPU utilization?** That's where flame graphs come in!
2525

26-
## How to use Flame graphs to debug performance issues and save money
27-
Let's say that this Flamegraph represents the timespan that corresponds with the "incident" where CPU usage spiked in the picture above. What that would indicate is that during this spike, you're servers CPUs were spending:
26+
## How to use flame graphs to debug performance issues and save money
27+
Let's say the flame graph below represents the timespan that corresponds with the "incident" in the picture above where CPU usage spiked. During this spike, the server's CPUs were spending:
2828
- 75% of time in `foo()`
2929
- 25% of time in `bar()`
3030

3131
![pyro_python_blog_example_00-01](https://user-images.githubusercontent.com/23323466/105620812-75197b00-5db5-11eb-92af-33e356d9bb42.png)
3232

33-
You can think of a Flamegraph like a super detailed pie chart, where the biggest nodes are taking up most of the CPU resources.
34-
- The width represents 100% of the time range
33+
You can think of a flame graph like a super detailed pie chart, where:
34+
- The width of the flame graph represents 100% of the time range
3535
- Each node represents a function
36+
- The biggest nodes are taking up most of the CPU resources
3637
- Each node is called by the node above it
3738

38-
In this case, `foo()` is taking up the bulk of the time (75%), so we can look at it improving `foo()` and the functions it calls in order to decrease our CPU usage (and $$).
39+
In this case, `foo()` is taking up 75% of the total time range, so we can improve `foo()` and the functions it calls in order to decrease our CPU usage (and save $$).
3940

40-
## Creating a Flamegraph and Table with Pyroscope
41-
To create this example in actual code we'll use Pyroscope - an open-source continuous profiler that was built specifically for the use case of debugging performance issues. To simulate the server doing work, I've created a `work(duration)` function that simply simulates doing work for the duration passed in. This way, we can replicate `foo()` taking 75% of time and `bar()` taking 25% of the time by producing this flamegraph from the code beneath it.
41+
## Creating a flame graph and Table with Pyroscope
42+
To recreate this example with actual code, we'll use Pyroscope - an open-source continuous profiler that was built specifically for debugging performance issues. To simulate the server doing work, I've created a `work(duration)` function that simulates doing work for the duration passed in. This way, we can replicate `foo()` taking 75% of time and `bar()` taking 25% of the time by producing this flame graph from the code below:
4243

4344
![image](https://user-images.githubusercontent.com/23323466/105621021-f96cfd80-5db7-11eb-8ceb-055ffd4bbcd1.png)
4445

@@ -58,7 +59,7 @@ def foo():
5859
def bar():
5960
work(25000)
6061
```
61-
Then, let's say you optimize your code to decrease `foo()` time from 75000 to 8000, but left all other portions of the code the same. The new code and flamegraph would look like:
62+
Then, let's say you optimize your code to decrease `foo()` time from 75000 to 8000, but left all other portions of the code the same. The new code and flame graph would look like:
6263

6364
![image](https://user-images.githubusercontent.com/23323466/105621075-a9db0180-5db8-11eb-9716-a9b643b9ff5e.png)
6465

@@ -79,6 +80,6 @@ def b():
7980
work(25000)
8081
```
8182

82-
What this means is that your total cpu utilization decreased 66%. If you were paying $100,000 dollars for your servers, you could now manage the same load for $66,000.
83+
This means your total CPU utilization decreased 66%. If you were paying $100,000 dollars for your servers, you could now manage the same load for just $66,000.
8384

8485
![image](https://user-images.githubusercontent.com/23323466/105621350-659d3080-5dbb-11eb-8a25-bf358458e5ac.png)

0 commit comments

Comments
 (0)