Notice

Recent Posts

Recent Comments

Link

« 2025/04 »
일	월	화	수	목	금	토
		1	2	3	4	5
6	7	8	9	10	11	12
13	14	15	16	17	18	19
20	21	22	23	24	25	26
27	28	29	30

Tags more

Archives

Today

Total

관리 메뉴

나의 잡다한 노트 및 메모

모니터링 방법론 본문

DevOps/SRE

모니터링 방법론

peanutwalnut 2024. 9. 8. 18:50

모니터링 방법론

USE method

https://www.brendangregg.com/usemethod.html

For every resource, check utilization, saturation, and erros.

resource

CPU : sockets, cores, hardware threads ( virtual CPUs )
memory : capacity
network interface
storage devices : I/O, capacity
Controllers: storage, network cards
Interconnects: CPU, memory, I/O
Utilization (이용률): the average time that the resource was busy servicing work
Saturation (포화 상태): the degree to which the resource has extra work which it can’t service, often queuedcpu 대기열, 디스크 I/O 대기열, 네트워크 대기열
container cpu throttling, cpu i/o wait load, network bandwidth, busting
Errors : the count of error events
http 5xx error , 디스크 I/O 오류, 네트워크 패킷 손실,

RED method

https://www.youtube.com/watch?v=TJLpYXbnfQ4

Rate : the number of requests per second
Errors : the number of those requests that are failing
Duration : the amount of time those requests take

Google SRE

https://sre.google/sre-book/monitoring-distributed-systems/

The Four Golden Signals

The four golden signals of monitoring are latency, traffic, errors, and saturation. If you can only measure four metrics of your user-facing system, focus on these four.

Latency

The time it takes to service a request. It’s important to distinguish between the latency of successful requests and the latency of failed requests. For example, an HTTP 500 error triggered due to loss of connection to a database or other critical backend might be served very quickly; however, as an HTTP 500 error indicates a failed request, factoring 500s into your overall latency might result in misleading calculations. On the other hand, a slow error is even worse than a fast error! Therefore, it’s important to track error latency, as opposed to just filtering out errors.

Traffic

A measure of how much demand is being placed on your system, measured in a high-level system-specific metric. For a web service, this measurement is usually HTTP requests per second, perhaps broken out by the nature of the requests (e.g., static versus dynamic content). For an audio streaming system, this measurement might focus on network I/O rate or concurrent sessions. For a key-value storage system, this measurement might be transactions and retrievals per second.

Errors

The rate of requests that fail, either explicitly (e.g., HTTP 500s), implicitly (for example, an HTTP 200 success response, but coupled with the wrong content), or by policy (for example, "If you committed to one-second response times, any request over one second is an error"). Where protocol response codes are insufficient to express all failure conditions, secondary (internal) protocols may be necessary to track partial failure modes. Monitoring these cases can be drastically different: catching HTTP 500s at your load balancer can do a decent job of catching all completely failed requests, while only end-to-end system tests can detect that you’re serving the wrong content.

Saturation

How "full" your service is. A measure of your system fraction, emphasizing the resources that are most constrained (e.g., in a memory-constrained system, show memory; in an I/O-constrained system, show I/O). Note that many systems degrade in performance before they achieve 100% utilization, so having a utilization target is essential.In complex systems, saturation can be supplemented with higher-level load measurement: can your service properly handle double the traffic, handle only 10% more traffic, or handle even less traffic than it currently receives? For very simple services that have no parameters that alter the complexity of the request (e.g., "Give me a nonce" or "I need a globally unique monotonic integer") that rarely change configuration, a static value from a load test might be adequate. As discussed in the previous paragraph, however, most services need to use indirect signals like CPU utilization or network bandwidth that have a known upper bound. Latency increases are often a leading indicator of saturation. Measuring your 99th percentile response time over some small window (e.g., one minute) can give a very early signal of saturation.Finally, saturation is also concerned with predictions of impending saturation, such as "It looks like your database will fill its hard drive in 4 hours."

If you measure all four golden signals and page a human when one signal is problematic (or, in the case of saturation, nearly problematic), your service will be at least decently covered by monitoring.

'DevOps > SRE' 카테고리의 다른 글

SRE 란? (0)	2025.04.05
모니터링을 위해 수집되야 할 DB 주요 지표와 의미 (0)	2025.03.26

'DevOps/SRE' Related Articles

내 블로그 - 관리자 홈 전환	`Q` `Q`
새 글 쓰기	`W` `W`

글 수정 (권한 있는 경우)	`E` `E`
댓글 영역으로 이동	`C` `C`

이 페이지의 URL 복사	`S` `S`
맨 위로 이동	`T` `T`
티스토리 홈 이동	`H` `H`
단축키 안내	`Shift` + `/` `⇧` + `/`

나의 잡다한 노트 및 메모

나의 잡다한 노트 및 메모

모니터링 방법론 본문

모니터링 방법론

모니터링 방법론

USE method

RED method

Google SRE

The Four Golden Signals

Latency

Traffic

Errors

Saturation

'DevOps > SRE' 카테고리의 다른 글

티스토리툴바

단축키

내 블로그

블로그 게시글

모든 영역