문제의 시작
분산 추적에서 가장 답답한 순간은 로그가 많지 않은 때가 아니라, 로그는 많은데 하나의 요청으로 이어지지 않을 때다. Async 작업은 thread가 바뀌면서 context가 끊기기 쉽고, 이때 traceId가 사라지면 장애 분석이 급격히 어려워진다. 이 글은 Spring Boot async 흐름에서 traceId를 어떻게 전파했는지 정리한 기록이다.
Spring Boot service에서 서버 로직 오류가 발생했을 때, 관련 로그를 가장 빠르게 찾는 방법 중 하나가 traceId 검색이다. 예를 들어 AWS CloudWatch log group에서 "${traceId}"로 검색할 수 있다.
그래서 traceId가 일관되게 남는 것이 중요하다. single thread 동기 작업에서는 크게 어렵지 않다.
하지만 비동기 작업에서는 실행 흐름이 다른 thread로 넘어갈 수 있다. 이때 원래 thread의 traceId가 worker thread로 전파되지 않으면, worker thread에서 문제가 발생했을 때 같은 요청 흐름을 따라가기 어려워진다.
Spring Boot 2.x에서는 Spring Sleuth가 traceId propagation을 자연스럽게 처리했다. 하지만 Spring Boot 3.x로 넘어오면서 Sleuth가 deprecated 되었고, tracing 경로가 Micrometer 쪽으로 이동했다. Micrometer는 metrics/tracing integration에서 주로 사용되지만, 이전에 기대하던 async traceId propagation이 기본으로 동작하지 않았다. 기존 프로젝트 로직을 3.x로 이관할 때 이런 작은 차이가 계속 이슈가 된다.
구현하면서 확인한 흐름
implementation("io.micrometer", "micrometer-observation")
implementation("io.micrometer", "micrometer-tracing")
implementation("io.micrometer", "micrometer-tracing-bridge-brave")
implementation("io.micrometer", "context-propagation")
tracing과 context propagation 관련 dependency를 추가한다. context propagation 프로젝트는 https://github.com/micrometer-metrics/context-propagation에 공개되어 있다.
@Configuration
class AsynchronousConfiguration: AsyncConfigurer {
@Bean
fun threadPoolTaskExecutor(): ThreadPoolTaskExecutor = ThreadPoolTaskExecutor().apply {
corePoolSize = CORE_POOL_SIZE
maxPoolSize = MAX_POOL_SIZE
queueCapacity = QUEUE_CAPACITY
setThreadNamePrefix("async-")
}
@Bean(ASYNC_TASK_EXECUTOR_BEAN_NAME)
fun asyncTaskExecutor(delegate: ThreadPoolTaskExecutor): DelegatingSecurityContextAsyncTaskExecutor {
return DelegatingSecurityContextAsyncTaskExecutor(delegate)
}
companion object {
const val ASYNC_TASK_EXECUTOR_BEAN_NAME = "asyncTaskExecutor"
const val CORE_POOL_SIZE = 10
const val MAX_POOL_SIZE = 50
const val QUEUE_CAPACITY = 50
}
}
일반적인 asynchronous configuration class를 가정하자. threadPoolTaskExecutor를 bean으로 선언해 core size, max size, queue capacity 같은 thread pool 설정을 잡는다.
이후 비동기 method에서는 이 executor 이름을 참조하면 된다.
비동기 thread 동작을 보기 위해 작은 service method를 만들었다.
하나의 async method가 다른 async method를 호출하도록 만들어, parent/child 형태의 실행 흐름을 관찰할 수 있게 했다.
@Async(AsynchronousConfiguration.ASYNC_TASK_EXECUTOR_BEAN_NAME)
fun startParentTask(traceId: String) {
val parentThreadName = Thread.currentThread().name
log.info("Parent task started with traceId: $traceId on thread: $parentThreadName")
childService.startChildTask(traceId)
Thread.sleep(1000)
log.info("Parent task completed with traceId: $traceId on thread: $parentThreadName")
}
우선 부모 메소드이다. 로그에서 thread tracing 을 위해 현재 thread 이름과 함께 임의로 생성된 traceId 를 인자로 넘겨받아 slf4j logging 을 한다.
Thread 에 sleep 을 걸어 로직이 돌아가는 시간을 임의로 강제해본다.
@Async(AsynchronousConfiguration.ASYNC_TASK_EXECUTOR_BEAN_NAME)
fun startChildTask(traceId: String) {
val childThreadName = Thread.currentThread().name
val traceId = Thread.currentThread().stackTrace
log.info("Child task started with traceId: $traceId on thread: $childThreadName")
Thread.sleep(1000)
log.info("Child task completed with traceId: $traceId on thread: $childThreadName")
}
child method도 같은 형태로 구현한다.
@GetMapping("/test-trace")
fun testTraceId(): String {
val traceId = java.util.UUID.randomUUID().toString()
log.info("Starting asynchronous job with traceId: $traceId")
parentService.startParentTask(traceId)
return "Asynchronous job started with traceId: $traceId"
}
테스트용 controller다. 여러 request를 빠르게 보내며 확인할 수 있도록 entry point에서도 로그를 남긴다. traceId는 여기에서 무작위로 생성한다.
logging:
pattern:
dateformat: "yyyy-MM-dd HH:mm:ss.SSS"
level: "%5p [${spring.application.name:},%X{traceId:-},%X{spanId:-}]"
traceId와 spanId가 모든 로그 라인에 보이도록 log pattern을 설정한다.
이제 propagation을 위해 TaskDecorator bean을 만든다.
@Bean
fun taskDecorator(): TaskDecorator {
return TaskDecorator { runnable: Runnable? ->
ContextSnapshot.captureAll(*arrayOfNulls(0)).wrap(
runnable!!,
)
}
}
https://blog.gangnamunni.com/post/mdc-context-task-decorator/ 좀 예전글이긴 한데 비슷한 경우에서 이슈를 해결한 강남언니 기술블로그 글이다.
https://docs.spring.io/spring-framework/docs/current/javadoc-api/org/springframework/core/task/TaskDecorator.html 공식 문서를 보면 TaskDecorator는 task 실행 전에 wrapper를 적용할 수 있는 hook이다. monitoring, metrics, context propagation 같은 용도로 쓰기 좋다.
decorator를 만들었으니 앞에서 만든 threadPoolTaskExecutor에 연결한다.
@Bean
fun threadPoolTaskExecutor(): ThreadPoolTaskExecutor = ThreadPoolTaskExecutor().apply {
corePoolSize = CORE_POOL_SIZE
maxPoolSize = MAX_POOL_SIZE
queueCapacity = QUEUE_CAPACITY
setThreadNamePrefix("async-")
setTaskDecorator(taskDecorator()) << 이거 추가
}
여러 API request를 보내면서 trace context가 유지되는지 확인한다.
2024-03-09 19:51:03.182 INFO [,65ec3f17e6d38686445bf9e99014b824,445bf9e99014b824] 25486 --- [nio-8080-exec-1] kr.projects.trace.TraceController : Starting asynchronous job with traceId: 3f0f0f1e-c0b3-4cef-99e3-7cbfd809e164
2024-03-09 19:51:03.183 INFO [,65ec3f17e6d38686445bf9e99014b824,445bf9e99014b824] 25486 --- [nio-8080-exec-1] kr.projects.trace.ParentService : Parent task started with traceId: 3f0f0f1e-c0b3-4cef-99e3-7cbfd809e164 on thread: http-nio-8080-exec-1
2024-03-09 19:51:03.183 INFO [,65ec3f17e6d38686445bf9e99014b824,445bf9e99014b824] 25486 --- [nio-8080-exec-1] kr.projects.trace.ChildService : Child task started with traceId: [Ljava.lang.StackTraceElement;@6223d89e on thread: http-nio-8080-exec-1
2024-03-09 19:51:04.189 INFO [,65ec3f17e6d38686445bf9e99014b824,445bf9e99014b824] 25486 --- [nio-8080-exec-1] kr.projects.trace.ChildService : Child task completed with traceId: [Ljava.lang.StackTraceElement;@6223d89e on thread: http-nio-8080-exec-1
2024-03-09 19:51:05.191 INFO [,65ec3f17e6d38686445bf9e99014b824,445bf9e99014b824] 25486 --- [nio-8080-exec-1] kr.projects.trace.ParentService : Parent task completed with traceId: 3f0f0f1e-c0b3-4cef-99e3-7cbfd809e164 on thread: http-nio-8080-exec-1
2024-03-09 19:51:44.421 INFO [,65ec3f40987e7c543582282a039a134d,3582282a039a134d] 25486 --- [nio-8080-exec-2] kr.projects.trace.TraceController : Starting asynchronous job with traceId: dfebee98-f8e6-450e-be36-bd9703a46923
2024-03-09 19:51:44.434 INFO [,65ec3f40987e7c543582282a039a134d,3582282a039a134d] 25486 --- [nio-8080-exec-2] kr.projects.trace.ParentService : Parent task started with traceId: dfebee98-f8e6-450e-be36-bd9703a46923 on thread: http-nio-8080-exec-2
2024-03-09 19:51:44.436 INFO [,65ec3f40987e7c543582282a039a134d,3582282a039a134d] 25486 --- [nio-8080-exec-2] kr.projects.trace.ChildService : Child task started with traceId: [Ljava.lang.StackTraceElement;@7cf9c5a5 on thread: http-nio-8080-exec-2
2024-03-09 19:51:44.620 INFO [,65ec3f405ae8bf117fd3ec3d607efd2d,7fd3ec3d607efd2d] 25486 --- [nio-8080-exec-3] kr.projects.trace.TraceController : Starting asynchronous job with traceId: 3e7b5aaf-0ee9-477b-9db9-0a8d98746945
2024-03-09 19:51:44.637 INFO [,65ec3f405ae8bf117fd3ec3d607efd2d,7fd3ec3d607efd2d] 25486 --- [nio-8080-exec-3] kr.projects.trace.ParentService : Parent task started with traceId: 3e7b5aaf-0ee9-477b-9db9-0a8d98746945 on thread: http-nio-8080-exec-3
2024-03-09 19:51:44.637 INFO [,65ec3f405ae8bf117fd3ec3d607efd2d,7fd3ec3d607efd2d] 25486 --- [nio-8080-exec-3] kr.projects.trace.ChildService : Child task started with traceId: [Ljava.lang.StackTraceElement;@7ebc7ff2 on thread: http-nio-8080-exec-3
2024-03-09 19:51:44.824 INFO [,65ec3f4064b2ba57b991da1e9a58f904,b991da1e9a58f904] 25486 --- [nio-8080-exec-4] kr.projects.trace.TraceController : Starting asynchronous job with traceId: db6dee41-5500-4f49-a902-d4ed52dfdf1a
2024-03-09 19:51:44.825 INFO [,65ec3f4064b2ba57b991da1e9a58f904,b991da1e9a58f904] 25486 --- [nio-8080-exec-4] kr.projects.trace.ParentService : Parent task started with traceId: db6dee41-5500-4f49-a902-d4ed52dfdf1a on thread: http-nio-8080-exec-4
2024-03-09 19:51:44.825 INFO [,65ec3f4064b2ba57b991da1e9a58f904,b991da1e9a58f904] 25486 --- [nio-8080-exec-4] kr.projects.trace.ChildService : Child task started with traceId: [Ljava.lang.StackTraceElement;@2c146429 on thread: http-nio-8080-exec-4
2024-03-09 19:51:45.023 INFO [,65ec3f4105c42d6e1380816172851f52,1380816172851f52] 25486 --- [nio-8080-exec-5] kr.projects.trace.TraceController : Starting asynchronous job with traceId: 08d2f0c0-b588-4a84-a65c-8d6c1a06993f
2024-03-09 19:51:45.024 INFO [,65ec3f4105c42d6e1380816172851f52,1380816172851f52] 25486 --- [nio-8080-exec-5] kr.projects.trace.ParentService : Parent task started with traceId: 08d2f0c0-b588-4a84-a65c-8d6c1a06993f on thread: http-nio-8080-exec-5
2024-03-09 19:51:45.024 INFO [,65ec3f4105c42d6e1380816172851f52,1380816172851f52] 25486 --- [nio-8080-exec-5] kr.projects.trace.ChildService : Child task started with traceId: [Ljava.lang.StackTraceElement;@27562cf1 on thread: http-nio-8080-exec-5
2024-03-09 19:51:45.220 INFO [,65ec3f41b88d12392f0aed8c54e60a0a,2f0aed8c54e60a0a] 25486 --- [nio-8080-exec-6] kr.projects.trace.TraceController : Starting asynchronous job with traceId: 26adfe5e-2730-4a31-97eb-49c0eacb9a28
2024-03-09 19:51:45.220 INFO [,65ec3f41b88d12392f0aed8c54e60a0a,2f0aed8c54e60a0a] 25486 --- [nio-8080-exec-6] kr.projects.trace.ParentService : Parent task started with traceId: 26adfe5e-2730-4a31-97eb-49c0eacb9a28 on thread: http-nio-8080-exec-6
2024-03-09 19:51:45.220 INFO [,65ec3f41b88d12392f0aed8c54e60a0a,2f0aed8c54e60a0a] 25486 --- [nio-8080-exec-6] kr.projects.trace.ChildService : Child task started with traceId: [Ljava.lang.StackTraceElement;@6b7c68eb on thread: http-nio-8080-exec-6
2024-03-09 19:51:45.438 INFO [,65ec3f40987e7c543582282a039a134d,3582282a039a134d] 25486 --- [nio-8080-exec-2] kr.projects.trace.ChildService : Child task completed with traceId: [Ljava.lang.StackTraceElement;@7cf9c5a5 on thread: http-nio-8080-exec-2
2024-03-09 19:51:45.638 INFO [,65ec3f405ae8bf117fd3ec3d607efd2d,7fd3ec3d607efd2d] 25486 --- [nio-8080-exec-3] kr.projects.trace.ChildService : Child task completed with traceId: [Ljava.lang.StackTraceElement;@7ebc7ff2 on thread: http-nio-8080-exec-3
2024-03-09 19:51:45.829 INFO [,65ec3f4064b2ba57b991da1e9a58f904,b991da1e9a58f904] 25486 --- [nio-8080-exec-4] kr.projects.trace.ChildService : Child task completed with traceId: [Ljava.lang.StackTraceElement;@2c146429 on thread: http-nio-8080-exec-4
2024-03-09 19:51:46.026 INFO [,65ec3f4105c42d6e1380816172851f52,1380816172851f52] 25486 --- [nio-8080-exec-5] kr.projects.trace.ChildService : Child task completed with traceId: [Ljava.lang.StackTraceElement;@27562cf1 on thread: http-nio-8080-exec-5
2024-03-09 19:51:46.224 INFO [,65ec3f41b88d12392f0aed8c54e60a0a,2f0aed8c54e60a0a] 25486 --- [nio-8080-exec-6] kr.projects.trace.ChildService : Child task completed with traceId: [Ljava.lang.StackTraceElement;@6b7c68eb on thread: http-nio-8080-exec-6
2024-03-09 19:51:46.442 INFO [,65ec3f40987e7c543582282a039a134d,3582282a039a134d] 25486 --- [nio-8080-exec-2] kr.projects.trace.ParentService : Parent task completed with traceId: dfebee98-f8e6-450e-be36-bd9703a46923 on thread: http-nio-8080-exec-2
2024-03-09 19:51:46.639 INFO [,65ec3f405ae8bf117fd3ec3d607efd2d,7fd3ec3d607efd2d] 25486 --- [nio-8080-exec-3] kr.projects.trace.ParentService : Parent task completed with traceId: 3e7b5aaf-0ee9-477b-9db9-0a8d98746945 on thread: http-nio-8080-exec-3
2024-03-09 19:51:46.830 INFO [,65ec3f4064b2ba57b991da1e9a58f904,b991da1e9a58f904] 25486 --- [nio-8080-exec-4] kr.projects.trace.ParentService : Parent task completed with traceId: db6dee41-5500-4f49-a902-d4ed52dfdf1a on thread: http-nio-8080-exec-4
2024-03-09 19:51:47.032 INFO [,65ec3f4105c42d6e1380816172851f52,1380816172851f52] 25486 --- [nio-8080-exec-5] kr.projects.trace.ParentService : Parent task completed with traceId: 08d2f0c0-b588-4a84-a65c-8d6c1a06993f on thread: http-nio-8080-exec-5
2024-03-09 19:51:47.226 INFO [,65ec3f41b88d12392f0aed8c54e60a0a,2f0aed8c54e60a0a] 25486 --- [nio-8080-exec-6] kr.projects.trace.ParentService : Parent task completed with traceId: 26adfe5e-2730-4a31-97eb-49c0eacb9a28 on thread: http-nio-8080-exec-6
운영 추적 기준
TraceId 전파는 logging을 예쁘게 만드는 문제가 아니라 운영에서 요청을 복원하는 능력이다. Thread boundary가 생기는 순간 context propagation을 명시적으로 다뤄야 하고, migration 과정에서는 기존 tracing library가 제공하던 자동 동작이 사라졌는지 반드시 확인해야 한다.
매 줄 INFO 뒤 대괄호 속 두번째, 세번째 값이 각각 traceId 와 spanId 인데, 로직이 돌아갈 때 thread 가 변경되었음에도 기존의 traceId 가 잘 전파됨을 확인할 수 있다.