포스트

마이크로소프트는 어떻게 1조 달러를 증발시켰는가

마이크로소프트는 어떻게 1조 달러를 증발시켰는가

— Azure 내부 고발자의 6부작 폭로 보고서 완전 분석 —

원문 출처: Substack, “isolveproblems” (총 6부, 2025년 연재)
작성 기준일: 2026년 4월 (최신 검색 정보 반영)
성격: 마이크로소프트 Azure Core 전 수석 엔지니어의 실명·실체험 내부 고발 시리즈


들어가며 — 이 문서가 중요한 이유

이 문서는 단순한 기업 스캔들 이야기가 아니다. 전 세계 수억 명이 사용하는 클라우드 인프라가, 그리고 OpenAI·미국 국방부 같은 전략적 고객들이 의존하는 플랫폼이 어째서 구조적으로 취약해졌는지를 추적하는 치밀한 내부 고발 보고서다. 저자는 마이크로소프트 커널 팀 및 Azure Core 팀 출신의 시니어 엔지니어로, 자신이 직접 경험한 사건들을 연대기적으로 서술한다.

보고서의 결론은 간단하다. Azure는 약속한 것을 제대로 만들지 못했고, 그 결과 OpenAI가 떠났으며, 미국 국방부는 ‘신뢰 위반’을 선언했고, 마이크로소프트의 시가총액은 2025년 최고점 대비 1조 달러 이상이 사라졌다.


1부 — 첫 번째 출근일의 충격: 불가능한 계획을 세우는 조직

저자 소개와 복직 배경

저자는 2023년 5월 1일, Azure Core 산하 ‘Overlake R&D 팀’에 시니어 엔지니어로 재합류한다. 이 팀은 Azure Boost 오프로드 카드와 네트워크 액셀러레이터를 개발하는 조직이다. 그는 이 자리가 처음이 아니었다. 2010년 Windows Azure가 세상에 처음 공개될 때부터 서비스를 사용해 온 그는, 마이크로소프트 Windows 팀에서 커널 엔지니어로 근무하며 Windows Container 플랫폼의 기반이 되는 기술들 — 서버·애플리케이션 Silo(코드명 Helium·Argon), Docker·Azure Kubernetes·Azure Container Instances·Windows Sandbox를 떠받치는 컨테이너 플랫폼 — 을 설계하고 특허까지 출원한 인물이다. 2020~2021년에는 Overlake 카드의 초기 브레인스토밍에 참여해, 불과 디버거 시리얼 연결 하나밖에 없던 시절부터 Host OS ↔ 가속기 카드 간 통신 프로토콜의 초안을 직접 설계한 사람이기도 하다.

입사 첫날, 첫 번째 충격

그는 신입 사원 오리엔테이션을 건너뛰고 오전 10시부터 팀 월간 기획 회의에 합류했다. 회의실에는 개발 매니저, 리드 엔지니어, 아키텍트, 수석·시니어 엔지니어들이 꽉 들어차 있었고, 화상 회의로 연결된 인원까지 포함하면 그 규모는 상당했다. 스크린에는 COM, WMI, 퍼프 카운터, VHDX, NTFS, ETW 등 Windows 내부 컴포넌트의 약어들이 빽빽하게 들어찬 다이어그램이 펼쳐져 있었다.

발표자는 ‘Principal Group Engineering Manager(수석 그룹 엔지니어링 매니저)’였다. 그가 제안하는 계획의 골자는 이랬다: 현재 Windows 사용자 공간(User Mode) 및 커널 컴포넌트들로 이루어진 Azure 노드 관리 스택 전체를 Overlake 가속기 카드로 이식(porting)한다.

저자는 즉각 손을 들었다. “그 Windows 기능들을 Overlake로 이식하겠다는 말인가요?” 대답은 “그렇다. 최소한 주니어 개발자 몇 명을 붙여서 살펴볼 수 있다”였다.

저자는 Overlake 카드의 하드웨어 사양을 이미 잘 알고 있었다. 손가락 끝만 한 크기에 팬도 없는 이 리눅스 구동 ARM SoC는, 일반 서버 CPU의 TDP에 비하면 극히 미미한 전력 예산을 가졌으며, FPGA에서 그의 도어벨 공유 메모리 통신 프로토콜에 할당된 이중 포트 메모리가 겨우 4KB에 불과했다. 그런 초소형 칩 위에 Windows 절반을 이식하겠다는 것은, 저자의 표현을 빌리자면 “화성 극지방을 핵폭탄으로 녹여 대기를 만들자”는 일론 머스크 식의 발상과 다를 바 없었다.

173개의 에이전트라는 미스터리

며칠 후 저자는 Linux System Group 책임자와 90분 넘게 직접 대화를 나눴다. 이 조직은 Mariner Linux(현 Azure Linux)와 Overlake 카드용 경량 배포판을 담당한다. 그 자리에서 저자가 확인한 숫자는 173이었다. Azure 노드를 관리하기 위해 실행 중인 에이전트의 수가 무려 173개였던 것이다.

더 충격적인 사실은, 마이크로소프트 내부 어느 누구도 이 173개의 에이전트가 각각 무슨 일을 하는지, 서로 어떻게 상호작용하는지, 왜 존재하는지를 명확히 설명할 수 없었다는 점이다. Azure의 핵심 기능은 결국 VM, 네트워킹, 스토리지다. 여기에 관측성(Observability)과 서비싱(Servicing)만 더하면 충분하다. SQL, Kubernetes, AI 워크로드는 모두 그 위에 쌓이는 것이다. 그런데 어째서 노드 하나를 관리하는 데 173개의 에이전트가 필요한지는 아무도 설명하지 못했다.

저자는 이 사실을 조직이 얼마나 현실 인식에서 멀어져 있는지를 보여주는 상징으로 제시한다. 동시에 그는 한 가지 무거운 의문을 독자들에게 던진다. “이 복잡하고 통제 불능의 소프트웨어 더미가, Anthropic의 Claude와 OpenAI의 API와 SharePoint Online과 정부 클라우드를 실제로 오케스트레이션하고 있다. 그 취약한 더미 속 모래 한 알이 전체 붕괴를 일으킬 수 있으며, 이는 국가 안보와 마이크로소프트의 사업 존폐에까지 영향을 미칠 수 있다.”


2부 — 조직 해부: 왜 이렇게 됐는가

Azure의 기원과 Dave Cutler의 유산

Azure의 이야기는 2006년으로 거슬러 올라간다. Amazon이 S3와 EC2를 출시하며 클라우드 시장을 선점하자, 마이크로소프트는 위기의식을 느꼈다. 코드명 ‘Red Dog’로 불린 Azure 개발 프로젝트는 단 다섯에서 여섯 명의 엘리트 엔지니어들로 시작됐으며, 그 선두에는 VMS와 Windows NT를 만든 전설적인 엔지니어 Dave Cutler가 있었다.

Cutler가 구상한 Azure Fabric Controller는 노드의 배치, 프로비저닝, 업데이트, 패치, 용량 관리, 로드 밸런싱, 스케일 아웃 등 모든 것을 인간의 개입 없이(without any operational intervention) 자율적으로 처리한다는 개념이었다. 2009년 ZDNet 인터뷰에서 Cutler는 “우리는 매우 보수적이며, 기능을 100% 완성하고 견고하게 디버깅하기 전에는 절대 공개하지 않는다”고 말했다. 그 인터뷰로부터 48주 후인 2010년 2월, Azure는 일반에 공개됐다. 치열한 시장 경쟁 압박 아래 급하게 출시된 것이다.

핵심 인력의 이탈과 그 후

프로젝트 성공 후 핵심 개발자들은 하나둘 팀을 떠났다. 당시 마이크로소프트의 주된 관심사는 여전히 PC, 태블릿, 스마트폰이었다. Windows 8, Windows Phone, Xbox One 프로젝트들이 회사의 에너지를 빨아들이는 동안 Azure 팀은 그 공백을 다른 방식으로 채워야 했다.

결정적인 전환점은 2014년 사티아 나델라가 CEO로 취임한 직후 찾아왔다. 나델라는 SDET(소프트웨어 개발 엔지니어 in 테스트) 직군을 전면 폐지하기로 결정했다. 이로 인해 대규모 해고가 발생했고, 남겨진 수백 명의 테스터들은 재교육을 받았다. 일부는 Windows 10 텔레메트리 담당 데이터 엔지니어가 됐고, 일부는 직급 강등을 감수하며 소프트웨어 엔지니어링 역할로 전환됐으며, 또 일부는 Azure OPEX(운영 비용 팀) 에 흡수됐다. OPEX 팀의 임무는 간단히 말해 ‘불 끄기’다. 24시간 상시 대기, 장애 대응, 사후 분석, 임시방편 스크립트 작성이 그들의 일상이었다.

그리고 2018년, 나델라는 회사의 무게중심을 Cloud + AI로 재편하며 Scott Guthrie를 책임자로 세웠다. Azure는 하룻밤 사이에 마이크로소프트의 가장 중요한 사업이 됐다. 문제는 사람이 바뀌지 않았다는 점이다. OPEX 팀 출신의 전 테스터들, 시스템 설계 경험이 제한적인 인력들이 이제 세계 최대 클라우드 회사의 핵심 운영을 담당하게 된 것이다.

2023년의 조직 실태

저자가 복귀한 2023년 당시, Compute Node Services를 담당하는 조직의 약 절반이 경력 1~2년짜리 주니어 엔지니어들로 구성돼 있었다. 그룹 엔지니어링 매니저의 배경은 웹 퍼포먼스(CSS 페이지 로드 최적화)였고, 개발 매니저는 Windows 경험이 제한적이었다.

소프트웨어의 현황은 더 암울했다.

첫째, 크래시의 홍수였다. 매달 수백만 건의 크래시가 발생했지만 대부분 원인 미상이었다. 팀들이 Azure Watson 크래시 리포팅 시스템에 자신들의 모듈 소유권을 등록조차 하지 않았기 때문이다. 결국 자동화된 트리아지는 거의 아무것도 잡아내지 못했고, 월간 뉴스레터에는 실제 데이터와 전혀 무관한 장밋빛 품질 지표들이 실렸다.

둘째, 테스트 커버리지가 40% 미만이었다. 대부분의 엔지니어가 로컬에서 소프트웨어를 빌드조차 제대로 하지 못했고, 디버거 사용 방법을 모르는 경우도 허다했다. 저자는 2024년에 팀 최초의 디버거 사용 가이드를 직접 작성해야 했다.

셋째, 매 릴리스마다 새로운 버그가 수정한 버그보다 더 많이 생겼다. 대부분의 배포는 공황 상태의 롤백으로 끝났다.

넷째, 리소스 누수가 만연했다. 크래시 시 파일, 디스크, 심지어 VM 전체가 누수됐다. 형편없는 에러 처리로 인해 디스크가 없는 VM 같은 기형적 구조물이 생성됐다. 고객이 이런 VM을 폐기하려 하면 노드 소프트웨어가 존재하지 않는 디스크를 분리하려 시도하며 하이퍼바이저 에러를 유발했고, Azure 팀은 이를 Hyper-V의 잘못으로 돌려 VP 레벨까지 불필요한 에스컬레이션이 반복됐다.


3부 — Cutler의 꿈과 현실의 간극: ‘디지털 에스코트’의 진실

“손 대지 않는 클라우드”라는 이상

Cutler가 처음 Azure를 설계할 때 세운 원칙 중 하나는 운영자가 물리 노드를 직접 손댈 필요가 없어야 한다는 것이었다. 모든 것은 자동화된 Fabric Controller가 처리해야 했다. 2009년의 인터뷰가 보여주듯, 이 철학은 Azure 설계의 근간이었다.

현실은 정반대였다.

ProPublica가 폭로한 ‘디지털 에스코트’ 프로그램

2025년 7월, 비영리 탐사 저널리즘 기관 ProPublica는 마이크로소프트의 충격적인 내부 관행을 폭로했다. 핵심 내용은 이렇다: 마이크로소프트는 미국 국방부를 포함한 정부 클라우드 고객의 시스템을 유지 보수하기 위해, 중국 등 해외에 위치한 엔지니어들에게 실질적인 작업을 맡기되, 미국 내 보안 허가(Security Clearance)를 가진 인원을 ‘디지털 에스코트’로 배치해 명령어를 복사·붙여넣기 하는 방식으로 운영해 왔다는 것이다.

문제는 이 에스코트들이 자신들이 감독해야 할 해외 엔지니어들보다 기술 수준이 훨씬 낮아, 실질적인 감독이 불가능했다는 점이다. 에스코트 중 한 명은 ProPublica에 이렇게 말했다: “우리는 그들이 하는 일이 악의적이지 않다고 믿고 있지만, 사실 확인할 방법이 없다.”

마이크로소프트는 이 프로그램을 약 10년 전, 즉 오바마 행정부 시절부터 운영했고, 이를 통해 수십억 달러 규모의 연방 클라우드 컴퓨팅 계약을 따냈다. CVP 레벨의 임원들이 이 전략을 승인했으며, 한 임원은 “디지털 에스코트 전략 덕분에 시장 진입 속도를 높일 수 있었다”고 인정했다.

국방장관의 선언: “신뢰 위반”

2025년 8월, 피트 헤그세스(Pete Hegseth) 국방장관은 비디오 성명을 통해 강력한 조치를 발표했다. “중국 국적자가 국방부 클라우드 환경을 유지 보수하는 것은 이제 끝났다. 우리는 마이크로소프트에 공식 우려 서한을 발송했으며, 이는 신뢰 위반(breach of trust)을 기록한 것이다.” 동시에 국방부는 중국 엔지니어들이 코드에 어떤 것을 심었는지 확인하기 위한 제3자 감사를 요구했다.

저자의 시각에서 이 사태는 우연이 아니다. 그가 2023~2024년에 Overlake 팀과 Compute Node Services에서 직접 목격한 소프트웨어의 만성적 불안정성 — 수백만 건의 크래시, 리소스 누수, 기형적 VM — 이 바로 이런 대규모 수동 개입을 불가피하게 만들었다. 소프트웨어가 제대로 작동하지 않으니, 사람이 직접 달라붙어 고칠 수밖에 없었던 것이다.

저자는 자신이 직접 집계한 수치를 공개한다. 2024년 8월 14일부터 10월 26일까지 약 두 달여 동안, 그가 별도 폴더에 정리한 JIT(Just-In-Time) 접근 요청 메시지만 14,209건, 즉 하루 평균 약 200건이었다. 이 요청이 승인되면 담당자는 8시간 동안 물리 노드와 Fabric Controller에 직접 접근할 수 있고, 최고 권한인 ‘RdmSecretsAdministrator’ 레벨에서는 비밀 키까지 관리할 수 있었다.


4부 — 밀도 증가 실험과 WireServer 보안 취약점

VM 밀도 증가 프로젝트의 참혹한 결과

2024년 봄과 여름, Azure는 노드당 VM 수를 늘리는 대규모 작업에 착수했다. 논리는 단순했다: 새 데이터센터를 짓는 것보다 기존 서버를 더 효율적으로 쓰는 것이 훨씬 저렴하다. 온프레미스 Azure 배포는 항상 노드당 16VM으로 제한됐고, 상업용 클라우드는 32VM까지 올린 상태였다. 목표는 48VM(50% 증가), 장기적으로는 64VM이었다.

하이퍼바이저 자체는 이론적으로 1,024VM을 지원할 수 있다. 그런데 실제 운영 수준이 32VM에 머물러 있었다는 것은, 소프트웨어 스택이 그 이상을 감당하지 못한다는 방증이었다.

결과는 예상대로였다. VM 수를 늘리자 크래시와 인시던트 수가 정확히 같은 비율로 증가했다. 50% 밀도 증가는 50% 크래시 증가를 불러왔다.

저자가 Core OS 팀과 함께 진행한 분석에 따르면, 노드 에이전트들이 WMI 사용자 모드 인터페이스를 통해 피크 시간대에 하이퍼바이저에 초당 최대 10,000회 호출을 날리고 있었다. Hyper-V 팀은 어떤 에이전트가 왜 이렇게 많은 호출을 하는지 파악조차 못했고, Azure 팀 역시 명확한 답을 내놓지 못했다. 이 시점에서 저자는 Overlake 오프로드 포팅이 절대로 성공할 수 없다는 결론에 도달했다.

WireServer와 IMDS: “걸어 다니는 보안 취약점”

저자가 다음으로 파고든 것은 ‘인스턴스 메타데이터 서비스(IMDS)’라고 불리는 컴포넌트였다. 표면적으로는 Amazon의 EC2를 모방해 고객 게스트 VM에 정보를 제공하는 서비스처럼 보이지만, 내부를 들여다보면 심각한 구조적 문제가 있었다.

이 서비스의 핵심인 ‘WireServer’ 웹 서버는 호스트 OS, 즉 보안 경계의 안쪽에서 실행된다. VM은 게스트와 호스트 사이에 강력한 격리를 제공한다. 그런데 호스트 OS는 각 VM의 메모리 페이지를 직접 매핑한다(Windows에서는 vmmem.exe 프로세스를 통해). 이는 VM 상태 저장 같은 운영 작업에 필요하기 때문이다. 그 필연적인 귀결로, 호스트가 침해되면 공격자는 해당 노드에서 실행 중인 모든 VM의 전체 메모리에 접근할 수 있다.

그런데 임의의 게스트 VM에서 직접 도달할 수 있는 웹 서버가 바로 그 민감한 호스트 OS 위에서 돌아가고 있었다. 저자가 추가로 발견한 사실들은 더 충격적이었다.

WireServer는 메모리 캐시에 다수의 테넌트 데이터를 암호화하지 않은 채 혼합 저장하고 있었다. 이는 적대적 멀티 테넌시(hostile multi-tenancy) 보안 지침에 대한 명백한 위반이다. 이론적으로 공격자는 같은 노드에 있는 다른 테넌트의 인증서 같은 비밀 정보까지 탈취할 수 있다.

더불어 잘못된 메모리 소유권 규칙 때문에 캐시 항목이 누수되고 캐시 전체가 유실되는 사태도 발생했다. WireServer 웹 서버 단독으로만 월 30만~50만 건의 크래시가 함대 전체에서 발생하고 있었다.

팀은 코드베이스에 C++ 예외를 도입했지만, 원래 이 코드는 예외-프리(exception-free) 방식으로 설계된 것이었다. 코딩 가이드라인은 상위 조직의 그것과 정면으로 충돌했고, 장기 실행 테스트가 없어 메모리 누수를 발견하지 못했다. 기술 부채가 너무 깊이 쌓인 나머지, 팀은 어떤 코드 개선도 거부했다. 저자가 스마트 포인터를 이용한 리팩토링과 버그 수정을 제출했지만, “무언가 망가질 수 있다”는 이유로 거절당했다.

저자는 WireServer/IMDS를 “걸어 다니는 보안 취약점(walking security liability)” 으로 규정하고, 이 서비스를 노드에서 완전히 제거해 퍼스트파티 클라우드 서비스로 분리 운영해야 한다고 권고했다. 이 권고는 저명한 보안 아키텍트(위협 모델링 관련 베스트셀러의 저자인 VP 레벨 인사)도 동의했다. 하지만 팀 리더십은 강하게 반발했고, 얼마 후 저자는 해고됐다.


5부 — 근본 원인과 처방: 구조적 해법이 외면당하다

나비 효과: 내부 결함이 외부 위기로

저자는 5부에서 사태의 전체 인과관계를 다시 정리한다. Azure는 처음부터 격렬한 경쟁 압박 아래 급하게 출시됐다. 기본 원칙들이 조용히 포기됐다. “결함은 사람이 직접 고친다”는 원칙이 묵인에서 공식 전략으로 격상됐고, 이것이 대규모 연방 클라우드 계약을 따내는 데 활용됐다.

그 결과:

  • 수동 개입(JIT 요청)은 일상화됐다.
  • OPEX 팀과 지원 엔지니어들이 시스템의 특권적 부분에 접근하는 빈도가 폭증했다.
  • 정부 클라우드 단독으로도 월 수백 건의 수동 개입이 이뤄졌으며, 훨씬 큰 규모의 상업용 클라우드 전체로 확대하면 그 수는 비교할 수 없이 많아진다.
  • 용량 한계 근처에서 시스템이 어떻게 작동할지는 아무도 모른다. 대규모 위기 상황에서 많은 고객이 갑자기 증가된 용량을 요구한다면, 결과는 재앙에 가까울 것이다.

저자의 처방: 컴포넌트화와 점진적 현대화

저자는 해고되기 전까지 여러 구조적 해법을 제안했다.

가장 핵심적인 아이디어는 컴포넌트화(componentization) 를 통한 점진적 현대화다. 원리는 간단하다. 기존 코드의 특정 섹션을 선택해, 테스트가 충분히 이루어지고 독립적으로 검증된 소규모 컴포넌트로 교체한다. 기존 코드 섹션을 삭제하고 그 자리에 새 컴포넌트 호출을 넣는다. 이를 반복하다 보면 결국 레거시 컴포넌트들은 새것을 호출하는 뼈대만 남게 된다. 이 방식은 운행 중인 시스템에 최소한의 충격을 주면서 현대화를 달성할 수 있다.

구체적으로 그가 구현하기 시작한 것들이 있었다. 클라우드 규모에서 신뢰성 있게 동작하는 파일 대량 삭제 솔루션, 테넌트 데이터를 분리 저장하는 암호화 LRU 캐시, Windows와 Linux 양쪽에서 동작하는 크로스 플랫폼 컴포넌트 모델, 그리고 게스트·호스트·SoC 간 에이전트들이 자유롭게 통신할 수 있는 새로운 메시지 버스 시스템이 그것이다.

또한 OpenAI의 베어메탈 노드를 위한 PM 스펙을 읽고 그들의 요구사항을 충족하기 위한 새로운 계획도 수립했다. Overlake 카드 자체의 하드웨어 확장이 필요한 부분을 도면으로 그려 기술 펠로우에게 전달하기도 했다.

하지만 이 모든 제안은 Azure Core 중간 관리층의 냉소와 반발에 부딪혔다. 그들의 눈에는 컴포넌트화 전략이 익숙한 ‘불 끄기’ 작업 방식에 도전하는 불편한 변화로 보였을 것이다.


6부 — CEO·이사회에 보낸 편지, 그리고 침묵

경영진에게 보낸 일련의 경고

저자는 내부 설득이 한계에 부딪히자 경영 사다리를 타고 위로 올라가기 시작했다.

2024년 11월 19일, 그는 Cloud + AI 담당 부사장(EVP)에게 상세한 편지를 보냈다. 기술적 발견 사항을 전부 기술하고, 리더십 공백을 지적하며, 근본 원인 해결을 위한 구체적인 제안을 담았다.

2025년 1월 7일, 그는 CEO 사티아 나델라에게 더 간결한 요약본을 보냈다. 국가 안보와 핵심 사업에 대한 잠재적 위험을 첫머리에 제시하고, Azure 노드 스택의 핵심 문제들과 조직의 과제를 압축해 전달했다. 아울러 자신이 Azure 노드 관리 레이어의 제로베이스 재설계를 주도할 준비가 되어 있다고 밝혔다.

아무런 응답이 없자 그는 법인 비서를 통해 마이크로소프트 이사회에도 편지를 보냈다. 이전 편지들에 대한 무응답을 언급하고, CEO에게 보낸 내용을 첨부하며, OpenAI의 이탈 조짐이 사전 경고를 무시한 결과로 볼 수 있다고 적었다.

결과는 완전한 침묵이었다. EVP로부터도, CEO로부터도, 이사회로부터도, 단 한 줄의 답장도, 단 한 번의 전화도 없었다. 수신 확인조차 없었다. 이 침묵은 저자에게 또 하나의 거대한 이야기를 전해주었다 — 이 조직은 내부에서 제기된 구조적 경고를 처리할 수 있는 채널을 사실상 갖추고 있지 않다는 것을.

결말: 예언의 성취

저자의 경고가 무시된 후 일어난 일들은 그의 분석을 사실로 증명했다.

2025년 3월 10일, OpenAI는 CoreWeave와 119억 달러 규모의 컴퓨팅 계약을 체결했다. Sam Altman은 “고급 AI 시스템에는 신뢰할 수 있는 컴퓨팅이 필요하다”고 말했다 — 이는 Azure의 신뢰성 문제를 직격하는 발언이었다. 당시 마이크로소프트는 OpenAI와의 ‘우선 거절권(ROFR)’ 조항을 앞세워 자신감을 내보이고 있었다. 나달라는 직전 다보스 포럼에서 “OpenAI가 필요로 하는 것을 마이크로소프트가 충족할 수 없을 때만 다른 곳을 볼 수 있다”고 말했다.

ROFR은 사실상 무의미해졌다. OpenAI는 2025년 5월 CoreWeave와 40억 달러를 추가 계약했고, 9월에는 65억 달러를 더 쌓아 올려 총 224억 달러 규모의 CoreWeave 계약을 완성했다. 같은 기간 Oracle과는 3,000억 달러 규모의 장기 컴퓨팅 계약을, AWS와는 380억 달러 계약을 맺으며 사실상 멀티클라우드 체제로 전환했다.

마이크로소프트의 주가는 2025년 7월 555달러(시가총액 4조 달러 이상)로 정점을 찍은 후 급락하기 시작해, 이후 수개월 만에 35% 이상 하락했다. 이는 1조 달러가 넘는 시가총액의 증발을 의미한다.

마이크로소프트는 2025년 5월과 7월에 각각 대규모 해고를 단행해 연간 약 15,000명의 일자리를 없앴다. 저자는 이것이 CoreWeave로의 계약 이탈로 인한 즉각적인 손실을 수익 보고서에서 만회하기 위한 조치였다고 본다.


종합 분석: 이 이야기의 구조적 함의

1. 기술 부채는 비가시적이지만 치명적이다

Azure 노드 관리 스택의 문제들은 수년간 내부적으로 인지됐지만, 외부에는 “순조롭게 진행 중”이라는 메시지만 전달됐다. 2023년부터 2025년까지 마이크로소프트는 주요 컨퍼런스에서 VM 관리 컴포넌트들이 Overlake/Azure Boost 오프로드 카드로 이식되고 Rust로 재작성됐다고 공식 발표했다. 하지만 저자의 증언에 따르면 2024년 말 기준으로, 1년 전 도출된 64개의 핵심 작업 항목 중 완료된 것은 하나도 없었고, 약 60개는 시작조차 되지 않았다.

2. 인력 구조의 변화가 시스템 품질을 결정한다

SDET 직군 폐지와 그로 인한 인력 재편은, 단기적으로는 비용 절감처럼 보였지만 장기적으로는 시스템 설계 역량의 공동화를 가져왔다. 테스트 엔지니어는 단순히 버그를 잡는 사람이 아니다. 시스템의 동작을 이해하고, 경계 조건을 찾아내며, 엔지니어링 품질을 독립적으로 검증하는 역할을 한다. 그 역할이 사라지자, 시스템의 품질 보증 기능 자체가 무너졌다.

3. ‘빠른 시장 진입’ 전략의 숨겨진 비용

“디지털 에스코트” 프로그램은 마이크로소프트가 표준적인 보안 요건을 충족하기 어려운 상황에서도 빠르게 시장에 진입하기 위한 창의적 해결책이었다. 단기적으로는 효과가 있었다. 수십억 달러 규모의 연방 클라우드 계약이 따라왔다. 그러나 장기적으로는 “신뢰 위반”이라는 국방부의 공식 서한과 제3자 감사 명령으로 이어졌다.

4. 경고 시스템의 부재

가장 씁쓸한 대목은 6부의 결말이다. 기술적으로 정확한 분석을, 증거와 함께, 정중하게, 구체적인 해결책을 제시하며 CEO와 이사회에 전달했지만 — 아무 반응도 없었다. 이는 마이크로소프트라는 조직이 내부 경고를 처리하는 구조적 메커니즘을 갖추지 못하고 있음을 시사한다. 문제를 발견하는 사람이 있어도, 그 문제가 의사결정권자에게 전달되고 행동으로 이어지는 경로가 없다면, 조직은 위기를 자초한다.

5. 클라우드 신뢰성의 국가안보 함의

서버 한 대에서의 소프트웨어 버그는 IT 문제다. 하지만 전국의 군사 시스템, 정보기관, 핵심 인프라가 의존하는 클라우드의 소프트웨어 버그는 국가안보 문제가 된다. 헤그세스 국방장관이 “이런 일은 절대 일어나서는 안 됐다”고 말한 것은, 단순한 계약 위반에 대한 질책이 아니라 미국의 핵심 디지털 인프라 보호에 대한 근본적인 의문 제기였다.


최신 현황 (2025~2026년 상황 업데이트)

OpenAI의 멀티클라우드 전환 완성

검색 결과들이 보여주듯, OpenAI는 2025년 한 해 동안 사실상 전면적인 컴퓨팅 다각화를 완성했다. CoreWeave에 224억 달러, Oracle에 3,000억 달러, AWS에 380억 달러를 약속하며 Azure 단일 의존을 끊어냈다. 마이크로소프트도 여전히 250억 달러 규모의 OpenAI 계약을 보유하고 있지만, 한때 독점적이었던 파트너십의 지위는 이미 사라졌다.

Microsoft 주가와 시가총액

마이크로소프트의 주가는 2025년 7월 555달러 사상 최고치에서 2026년 초까지 약 35% 하락했다. 이는 시가총액 기준으로 1조 달러 이상이 증발한 것에 해당한다. 물론 이는 AI 투자 과열 우려, 전반적인 기술주 조정, 이란 전쟁으로 인한 시장 불안 등 복합적 요인의 산물이기도 하지만, 시장 참여자들은 특히 Microsoft가 다른 Mag7 종목들보다 두드러지게 약세를 보인다는 점을 주목하고 있다.

Rust 만능주의의 한계

저자가 언급한 것처럼, Azure는 조직 전체에 새로운 소프트웨어를 Rust로 작성해야 한다는 만능 명령을 내렸다. 이에 따라 초기 프로토타입이 수천 개의 서드파티 Rust 크레이트(crate)를 끌어들였고, 그 중 상당수가 타동적 의존성(transitive dependency)으로 검증되지 않은 채 포함됐다. 이는 소프트웨어 공급망 위험(supply-chain risk)을 크게 높이는 요인이다.

ProPublica 보도 이후

2025년 7월 ProPublica 보도 이후 마이크로소프트는 즉시 중국 기반 엔지니어들의 국방부 클라우드 시스템 접근을 차단했다. 톰 코튼 상원 정보위원장은 국방부가 계약업체 감독 체계를 강화해야 한다고 요구했다. 현재 독립 감사가 진행 중이며, 중국 엔지니어들이 실제로 코드에 무언가를 심었는지 여부는 아직 확인되지 않았다.


결론 — 말발굽 소리는 이미 들렸다

저자는 시리즈 전체를 관통하는 은유를 반복적으로 사용한다: “말발굽 소리가 들리면 말을 먼저 생각하라, 얼룩말을 먼저 떠올리지 말고.” 이 비유의 의미는 간단하다. 가장 단순하고 명백한 설명이 대개 정답이라는 것이다.

Azure의 문제는 신비로운 원인이 없다. 경쟁 압박 아래 서두른 출시, 핵심 인력의 이탈, 테스트 문화의 붕괴, 기술 리더십의 공백, 임시방편을 표준 절차로 격상시킨 관행, 그리고 이 모든 것을 바로잡으라는 내부 경고를 무시한 경영진의 침묵 — 이것들이 원인의 전부다.

저자가 마지막 문단에서 쓴 것처럼, “말발굽 소리는 레드먼드 서쪽 캠퍼스의 Azure Core 건물들 너머에서도 들릴 만큼 충분히 컸다.” 과연 그 소리가 최고 경영진에게 들렸는지는 여전히 알 수 없다.

한 가지만은 분명하다. 그 소리를 먼저 들은 사람들 — OpenAI, 미국 국방부, 월스트리트의 투자자들 — 은 각자의 방식으로 이미 반응했다.


본 문서는 Substack “isolveproblems” 계정에 연재된 “How Microsoft Vaporized a Trillion Dollars” 6부작 시리즈를 바탕으로, ProPublica, TechCrunch, DefenseScoop, Yahoo Finance, MarketWise 등의 최신 뉴스 검색 결과를 추가 반영하여 작성됐습니다. 2026년 4월 기준.


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
https://open.substack.com/pub/isolveproblems/p/how-microsoft-vaporized-a-trillion

How Microsoft Vaporized a Trillion Dollars

Read distraction-free on Substack
This is the first of a series of articles in which you will learn about what may be one of the silliest, most preventable, and most costly mishaps of the 21st century, where Microsoft all but lost OpenAI, its largest customer, and the trust of the US government.
I joined Azure Core on the dull Monday morning of May 1st, 2023, as a senior member of the Overlake R&D team, the folks behind the Azure Boost offload card and network accelerator.
I wasn’t new to Azure, having run what is likely the longest-running production subscription of this cloud service, which launched in February 2010 as Windows Azure.
I wasn’t new to Microsoft either, having been part of the Windows team since 1/1/2013 and later helped migrate SharePoint Online to Azure, before joining the Core OS team as a kernel engineer, where I notably helped improve the kernel and helped invent and deliver the Container platform that supports Docker, Azure Kubernetes, Azure Container Instances, Azure App Services, and Windows Sandbox, all shipping technologies that resulted in multiple granted patents.
Furthermore, I contributed to brainstorming the early Overlake cards in 2020-2021, drafting a proposal for a Host OS <-> Accelerator Card communication protocol and network stack, when all we had was a debugger’s serial connection. I also served as a Core OS specialist, helping Azure Core engineers diagnose deep OS issues.
I rejoined in 2023 as an Azure expert on day one, having contributed to the development of some of the technologies on which Azure relies and having used the platform for more than a decade, both outside and inside Microsoft at a global scale.
As a returning employee, I skipped the New Employee Orientation and had my Global Security invite for 12 noon to pick up my badge, but my future manager asked if I could come in earlier, as the team had their monthly planning meeting that morning.
I, of course, agreed and arrived a few minutes before 10 am at the entrance of the Studio X building, not far from The Commons on the West Campus in Redmond. A man showed up in the lobby and opened the door for me. I followed him to a meeting room through a labyrinth of corridors.
The room was chock-full, with more people on a live conference call. The dev manager, the leads, the architects, the principal and senior engineers shared the space with what appeared to be new hires and junior personnel.
The screen projected a slide where I recognized a number of familiar acronyms, like COM, WMI, perf counters, VHDX, NTFS, ETW, and a dozen others, mixed with new Azure-related ones, in an imbroglio of boxes linked by arrows.
I sat quietly at the back while a man was walking the room through a big porting plan of their current stack to the Overlake accelerator. As I listened, it was not immediately clear what that series of boxes with Windows user-mode and kernel components had to do with that plan.
After a few minutes, I risked a question: Are you planning to port those Windows features to Overlake? The answer was yes, or at least they were looking into it. The dev manager showed some doubt, and the man replied that they could at least “ask a couple of junior devs to look into it.”
The room remained silent for an instant. I had seen the hardware specs for the SoC on the Overlake card in my previous tenure: the RAM capacity and the power budget, which was just a tiny fraction of the TDP you can expect from a regular server CPU.
The hardware folks I had spoken with told me they could only spare 4KB of dual-ported memory on the FPGA for my doorbell shared-memory communication protocol.
Everything was nimble, efficient, and power-savvy, and the team I had joined 10 minutes earlier was seriously considering porting half of Windows to that tiny, fanless, Linux-running chip the size of a fingernail.
That felt like Elon talking about colonizing Mars: just nuke the poles then grow an atmosphere! Easier said than done, uh?
That entire 122-strong org was knee-deep in impossible ruminations involving porting Windows to Linux to support their existing VM management agents.
The man was a Principal Group Engineering Manager overseeing a chunk of the software running on each Azure node; his boss, a Partner Engineering Manager, was in the room with us, and they really contemplated porting Windows to Linux to support their current software.
At first, I questioned my understanding. Was that serious? The rest of the talk left no doubt: the plan was outlined, and the dev leads were tasked with contributing people to the effort. It was immediately clear to me that this plan would never succeed and that the org needed a lot of help.
That first hour in the new role left me with a mix of strange feelings, stupefaction, and incredulity.
The stack was hitting its scaling limits on a 400 Watt Xeon at just a few dozen VMs per node, I later learned, a far cry from the 1,024 VMs limit I knew the hypervisor was capable of, and was a noisy neighbor consuming so many resources that it was causing jitter observable from the customer VMs.
There is no dimension in the universe where this stack would fit on a tiny ARM SoC and scale up by many factors. It was not going to happen.
I have seen a lot in my decades of industry (and Microsoft) experience, but I had never seen an organization so far from reality. My day-one problem was therefore not to ramp up on new technology, but rather to convince an entire org, up to my skip-skip-level, that they were on a death march.
Somewhere, I knew it was going to be a fierce uphill battle. As you can imagine, it didn’t go well, as you will later learn.
I spent the next few days reading more about the plans, studying the current systems, and visiting old friends in Core OS, my alma mater. I was lost away from home in a bizarre territory where people made plans that didn’t make sense with the aplomb of a drunk LLM.
I notably spent more than 90 minutes chatting in person with the head of the Linux System Group, a solid scholar with a PhD from INRIA, who was among the folks who hired me on the kernel team years earlier.
His org is responsible for delivering Mariner Linux (now Azure Linux) and the trimmed-down distro running on the Overlake / Azure Boost card. He kindly answered all my questions, and I learned that they had identified 173 agents (one hundred seventy-three) as candidates for porting to Overlake.
I later researched this further and found that no one at Microsoft, not a single soul, could articulate why up to 173 agents were needed to manage an Azure node, what they all did, how they interacted with one another, what their feature set was, or even why they existed in the first place.
Azure sells VMs, networking, and storage at the core. Add observability and servicing, and you should be good. Everything else, SQL, K8s, AI workloads, and whatnot all build on VMs with xPU, networking, and storage, and the heavy lifting to make the magic happen is done by the good Core OS folks and the hypervisor.
How the Azure folks came up with 173 agents will probably remain a mystery, but it takes a serious amount of misunderstanding to get there, and this is also how disasters are built.
Now, fathom for a second that this pile of uncontrolled “stuff” is orchestrating the VMs running Anthropic’s Claude, what’s left of OpenAI’s APIs on Azure, SharePoint Online, the government clouds and other mission-critical infrastructure, and you’ll be close to understanding how a grain of sand in that fragile pileup can cause a global collapse, with serious National Security implications as well as potential business-ending consequences for Microsoft.
We are still far from the vaporized trillion in market cap, my letters to the CEO, to the Microsoft Board of Directors, and to the Cloud + AI EVP and their total silence, the quasi-loss of OpenAI, the breach of trust with the US government as publicly stated by the Secretary of Defense, the wasted engineering efforts, the Rust mandate, my stint on the OpenAI bare-metal team in Azure Core, the escort sessions from China and elsewhere, and the delayed features publicly implied as shipping since 2023, before the work even began.
If you’re running production workloads on Azure or relying on it for mission-critical systems, this story matters more than you think.

https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-8f4

How Microsoft Vaporized a Trillion Dollars, Pt. 2

Read distraction-free on Substack
(Continued from Part 1)
What I discovered in the following weeks and months was a strained organization, exhausted by constant incidents, millions of unattended crashes in the Azure node management stack, conflicting coding standards, limited security awareness, weak testing practices, code freezes born of fear, unrealistic timelines, blame-shifting, and a noticeable gap in senior technical leadership.
Before diving deeper into each issue, it helps to understand how the team reached this point.
During my earlier tenure as a kernel engineer on the Windows Core OS team, I reported to one of the most talented operating system engineers I encountered at Microsoft.
He had decades of experience, stretching back to working with Dave Cutler on the Windows systems we know today. Among his contributions were the Server and Application Silos (code-named Helium and Argon), which form the foundation of the Windows Container platform.
He also worked on research operating systems such as Midori and Singularity, and was one of the original contributors to the Azure Fabric, the meta operating system that orchestrates Microsoft’s cloud infrastructure.
One day early in my time under him, he brought in some old team swag: a sweater emblazoned with 0xF0FFFF, the hex color of Azure.
From him and over the years, I learned not only about kernel design but also about Azure’s origins and the intense competitive pressure that shaped it.
Amazon had launched S3 and EC2 in 2006; Microsoft was late to the public cloud race and needed to move fast. The project, code-named Red Dog, began with a small team of just five or six elite engineers, led by Cutler.
The heavy lifting on the nodes falls to the hypervisor and modified host and guest OSes optimized for virtualized environments.
Nodes belong to clusters managed by the Fabric Controller, which handles resource inventory, VM placement, provisioning, servicing, load balancing, and scaling.
A set of agents, including the central RdAgent, reports back to the controller and orchestrates local resources on each node, as well as the creation of virtual machines.
Creating a VM is still fundamentally like ordering pizza (skipping some details): you choose from a menu of sizes and ingredients: 16 cores? Sure. 128 GB of memory? Done. 32 disks? No problem. Four NICs? GPU? You got it!
From there, the node software pilots the hypervisor to create the partition, attach the required devices, including a disk containing the boot image, and start the VM.
The project succeeded and shipped in February 2010 as Windows Azure. But as often happens with rushed, high-pressure efforts, many of the original core contributors eventually moved on.
At the time, Microsoft was still heavily focused on PCs, tablets, and phones. Teams were porting Windows to ARM, shipping Windows 8 and 8.1, and acquiring Nokia while reimagining Xbox One around Hyper-V under Cutler’s leadership.
Cloud was important but not yet central. OneDrive and SharePoint ran on separate infrastructure, and Azure remained a distant second to AWS.
Just months after Satya Nadella became CEO in February 2014, he canceled the dedicated SDET (Software Development Engineer in Test) role, triggering significant layoffs.
Due to Washington state WARN rules, Microsoft could not eliminate every tester position; hundreds remained.
Many of these testers, strong at execution but with limited experience in system design or deep software engineering, were retrained.
Some became data engineers focused on Windows 10 telemetry; others moved into software engineering roles (often down-leveled); and still others landed in lower-impact areas, including Azure OPEX, where they helped keep the lights on through on-call rotations and incident mitigation.
Fast forward, and large parts of Azure operations were being run by these former testers. Many were dedicated colleagues, but the shift left gaps in architectural depth for mission-critical systems.
OPEX teams exist to maintain production stability. Their work is grueling, with 24/7 on-call rotations, rapid mitigations, post-mortem analysis, and scripting fixes, leading to high attrition.
They typically do not design new software or own long-term bug fixes; instead, they file repair items for product teams and maintain a living knowledge base of incidents.
In 2018, Nadella repositioned the company around Cloud + AI and placed Scott Guthrie in charge.
Windows was reorganized under Azure, and overnight the existing Azure teams became central to Microsoft’s most strategic bet.
Most of the people stayed the same, save for a few high-profile transfers.
By the time I rejoined in 2023, roughly half the organization responsible for Compute Node Services consisted of junior engineers with only one or two years of experience.
The Group Engineering Manager’s background was in web performance (optimizing CSS for page load times), and the dev manager had limited Windows experience.
This group was now tasked with moving their inherited stack to the new Azure Boost accelerator environment, an effort Microsoft had publicly implied was well underway at Ignite conferences since 2023.
In reality, as the person responsible for the hypervisor-layer porting and reengineering, I knew the substantive work had barely begun.
The team had no clear starting point. The existing stack suffered from chronic crash-causing defects and memory leaks, leaving everyone firefighting.
Few engineers could reliably build the software locally; debugger usage was rare (I ended up writing the team's first how-to guide in 2024); and automated test coverage sat below 40%.
Every monthly release introduced more new defects than it fixed. Most rollouts were panicked rollbacks. Millions of crashes occurred each month, the majority unattributed because teams had never claimed ownership of their modules in the Azure Watson crash reporting system.
As a result, automated triage created few formal incidents, allowing monthly newsletters to tout glowing quality metrics unsupported by actual data.
The Core OS team often absorbed blame for issues originating in the Azure node software. Crashes frequently leaked resources: files, disks, even entire VMs.
Weak error handling led to malformed VMs (e.g., missing disks). When customers decommissioned them, the node software attempted to detach non-existent disks, triggering hypervisor errors.
The Azure team pointed fingers at Hyper-V, sparking escalations that reached VP level.
I once convened a high-stakes meeting with stakeholders from both sides; the Hyper-V leads were visibly frustrated by the repeated, misplaced blame.
Layered on this chaos was an Azure-wide mandate: all new software must be written in Rust. Some porting plans were abandoned, and many junior engineers grew excited by the new language.
Critical modules at the heart of Azure's node management, a critical part of the company's flagship Cloud + AI initiative, were sometimes designed by engineers with less than a year of tenure, under leads who lacked visibility into the details.
None of it shipped.
The VM management software continued to run and crash on Windows, despite repeated public statements from 2023 through 2025 claiming that key components had been offloaded to the Azure Boost accelerator and rewritten in Rust.
From my direct involvement, I know those claims did not reflect reality as late as the end of 2024. Of the 64 key work items identified a year earlier to reengineer the VM management stack for offload, none had been completed, and work had not even started on approximately 60 of them.
The list included foundational pieces such as a key-value store, tracing, logging, and observability infrastructure.
Worse, early prototypes already pulled in nearly a thousand third-party Rust crates, many of which were transitive dependencies and largely unvetted, posing potential supply-chain risks.
On top of all that, the org had a hard commitment to deliver the already long-delayed OpenAI bare-metal SKUs that had been promised for years. This work started around May 2024 with a target of Spring 2025 and was led by a Principal engineer who had evidently never tackled a task of that scale.
Fast-forward to March 10, 2025: OpenAI signed an $11.9 billion compute deal with CoreWeave for model training and services.
Sam Altman, OpenAI’s CEO, declared that “Advanced AI systems require reliable compute, and we’re excited to continue scaling with CoreWeave so we can train even more powerful models and offer great services to even more users” — words that landed as a pointed comment on Azure’s reliability and scalability.
This was significant because just weeks earlier at the World Economic Forum in Davos, Satya Nadella had highlighted Microsoft’s “ROFR” (right of first refusal) with OpenAI, stating that OpenAI would need to come to Microsoft first and could only look elsewhere if Microsoft could not deliver.
In September 2025, OpenAI—still technically under Microsoft’s ROFR—expanded its CoreWeave agreement by another $6.5 billion. Around the same period, OpenAI also committed to a massive, multi-year computing power deal with Oracle valued at $300 billion.
Microsoft, meanwhile, conducted major layoffs—approximately 15,000 roles across waves in May and July 2025 —most likely to compensate for the immediate losses to CoreWeave ahead of the next earnings calls.
One can reasonably infer that Microsoft struggled to meet OpenAI’s demanding requirements on time and at scale. That outcome should come as no surprise after reading this series.

https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-f67

How Microsoft Vaporized a Trillion Dollars, Pt. 3

Read distraction-free on Substack
(Continued from Part 2)
Circling back to the origins of Azure, Cutler’s intent was to produce a system with the same level of quality, unshakable reliability, and attention to detail he was famous for in his work on VMS and NT.
In a 2009 interview with ZDNET, he declared that the intent [for the Azure Fabric Controller] was that “it manages the placement, provisioning, updating, patching, capacity, load balancing, and scale out of nodes in the cloud all without any operational intervention.” (emphasis added)
From my years with one of the original contributors to the Fabric, I learned that touching the nodes by hand was also strictly off-limits: the original design intent was that Azure would operate without human intervention.
When discussing the discretion around Azure promises at the time, Cutler said, “The answer to this is simply that the RD group is very conservative and we are not anywhere close to being done.”
He further added that “[they] are taking each step slowly and attempting to have features 100% operational and solidly debugged before talking about them.”
That was on February 24, 2009. A mere 48 weeks later, Azure shipped for general consumption.
Fast forward to Summer 2025, and the Secretary of Defense, Pete Hegseth, publicly mentioned “a breach of trust” with Microsoft, following an article from ProPublica describing “digital escort sessions” conducted on Azure computers.
The article details how escort sessions involve specialized $18/hour employees who copy/paste and execute commands on government cloud nodes under direction from Microsoft support personnel, often based in foreign countries, including China.
However, direct node access and manual interventions are common daily practices that extend well beyond government clouds.
Cutler’s vision of a “no human touch” cloud service unfortunately never materialized, as the article mentions “hundreds of interactions” each month for the government clouds alone.
The article reveals that the program was devised at the highest levels of the company, with support from CVP-level contributors who declared that “the digital escort strategy allowed the company to ‘go to market faster,’ positioning it to win major federal cloud contracts.”
Azure shipped as an unfinished product under intense market pressure, and major corners were cut. Notably, routine manual intervention on the nodes was part of the strategy.
Marketing and competitive pressure often work in mysterious ways; however, the article does not explain why manual repairs were needed on the nodes.
The answer is now simple: the software didn't work as well as hoped, in large part because the system was rushed under intense pressure.
Cue the post-launch talent exodus, its replacement by people of very different experience levels, and you end up with a system that over-promises and under-delivers, drowning in unsolvable problems.
This gap between Cutler’s “no human touch” ideal and the reality of hundreds of monthly manual interventions wasn’t abstract for me.
In the Overlake team and Compute Node Services, the same underlying fragility I observed since day one, namely chronic crashes, resource leaks, malformed VMs, and a bloated agent ecosystem that no one could fully explain, created exactly the kind of instability that demanded constant human firefighting, including on sensitive government clouds.
What I encountered in 2023–2024 was not occasional edge cases, but a steady stream of symptoms from a system that had never been allowed to stabilize, despite the foundations, namely the hypervisor and Windows OS, being robust.
The manual escort sessions were, in many ways, the visible symptom of deeper architectural and process debt.
I began raising these issues internally, including through formal warnings that eventually reached the highest levels of the company.
On one particular occasion, a feature that had been baking for eleven months, intended to exchange secret encryption keys between some actor in the guest VMs and the host OS, generated two Sev-2 incidents within hours of being rolled out to general production.
It turned out that one of the agents was calling into another through an unknown endpoint, generating errors that were logged on both sides.
An infinite retry loop caused both agents to be busy logging errors, saturating the circular logs and reducing their horizon from the usual 2-3 days to about two hours.
This incident illustrates the lack of deep code ownership, overly complex inter-agent interactions, technical leadership gaps, and testing practices that allow major defects to reach production.
I distinctly remember asking the dev manager for permission to halt the worldwide rollout, and it took the teams the entire weekend and half of the following week to roll back the system to the previous version.
In another instance, it took three months, from January to March 2024, to run a file-deletion script across the fleet to clean up leaked files that had triggered a 100GB temporary files threshold on some nodes.
Systemic failures and limitations of the automated systems, internally known as “OaaS” and “Geneva Actions,” made a simple task daunting.
These incidents were emblematic of the daily reality for Azure OPEX teams: a constant flood of issues stemming directly from instabilities in the node software and in the surrounding support systems.
These were not isolated failures but part of a persistent pattern. The same poorly understood, interdependent agent ecosystem create fragile chains that turn minor changes into production crises.
For Azure customers, those failures manifest mostly during commissioning or decommissioning large numbers of resources, or other operations involving the node management stack.
Nodes experiencing failures are placed in an “unhealthy” state, and user workloads are migrated to other physical machines so the faulty node can be repaired, causing service interruptions as VMs must be suspended and the gigabytes of memory they consume copied to another machine, where the VMs are “rehydrated,” and these recovery operations are not immune to errors.
Resource leaks, crashes, “rogue” and “zombie” VMs, and node health issues are generally accommodated during normal times, as Azure has some room to spare and personnel to help with recovery around the clock.
However, how the system would cope near capacity, for example, in case of crisis, is anyone’s guess. A “run on the bank” where a large number of customers suddenly require increased capacity is likely to end in a disaster.
As these issues accumulated, I began raising them more formally through my management chain, including through structured warnings that ultimately reached senior leadership and beyond.
I also mentioned potential security issues that I had discovered along the way.
The responses varied from acknowledgment to defensiveness, revealing how deeply the culture had adapted to operating in a state of perpetual firefighting rather than addressing root causes.
This tension came to a head with the Azure-wide Rust mandate, conflicting porting plans, and the parallel demands of high-visibility projects such as the long-delayed OpenAI bare-metal SKUs.
What started as technical disagreements quickly exposed larger strategic and cultural fractures within the organization.

https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-2f5

How Microsoft Vaporized a Trillion Dollars, Pt. 4

Read distraction-free on Substack
(Continued from Part 3)
Azure has operated under constant strain for as long as I can remember.
Even during the periodic “quality pushes,” the backlog of issues never shrank; it only grew.
In the spring and summer of 2024, a major push began to raise the number of VMs each node could host.
The business case was straightforward: scaling up density on existing servers is far cheaper than building new data centers.
On-premise Azure deployments had always been capped at 16 VMs per node.
Microsoft’s own commercial clouds had run at 32 until that year, still a tiny fraction of the 1,024 the hypervisor itself could theoretically support.
The goal was a 50% increase to 48 VMs per node, with 64 as the longer-term target.
What should have been a matter of removing a few arbitrary software limits turned into a 50% increase in crashes and incidents. The problems scaled in exact proportion to the density.
Earlier, while I was still working on the hypervisor interface re-engineering plan for the bottom of the Azure node stack, I had run a study with the Core OS team that owned the other side of the Hypervisor API.
Call-trace data showed the node agents collectively hammering the hypervisor through its WMI user-mode interface at up to 10,000 calls per second during peak bursts.
The Hyper-V team had no visibility into which agents were responsible or why so many calls were necessary. On our side, no one could give a definitive answer either.
At that point, it became clear that the Overlake offload port would never happen.
Not only because of the dependencies I described earlier, but because of the sheer dynamic behavior of the stack.
The Hyper-V team had planned a cleaner, HCS-style interface with a gRPC frontend, but the Azure team, under tight timelines, decided to press ahead with the existing VM abstraction layer (VMAL) and keep calling through WMI on the host as a stopgap.
Even setting aside the Linux-port issues, the call volume made the plan impossible, even without factoring in the 50% and later 100% density increase expected to be layered on top.
These elements combined into what I came to see as an unsustainable stretch of work, a plan that lacked the necessary depth and visibility to succeed.
I stepped away from that part of the organization. The principal engineer who inherited the effort, a highly respected Windows veteran who had led the ARM32 port back in the Windows 8 era, lasted ten months before he, too, left the team.
The VM management stack never ran offloaded from the Overlake/Azure Boost SoC.
After stepping away from the VM density and offload work, I turned my attention to another foundational piece of the Azure node stack: the set of components the team called the “instance metadata services.”
The name was borrowed from Amazon’s EC2.
On Azure, it consists of a customer-facing web server (“WireServer”) running on each node’s host OS, together with supporting service components.
One of its endpoints is publicly documented and intended to provide information to guest VMs.
What stood out was that this web service runs on the host OS, the secure side of the machine.
Virtual machines are designed to provide strong isolation. A guest VM is a containment boundary: escaping it is difficult, and other VMs on the same node, as well as the host, share almost nothing with it. The VMs themselves act as security boundaries.
A less obvious fact is that the host OS is not isolated from the VMs in the same way.
The memory pages belonging to each VM partition are mapped into processes on the host. On Windows, these are the vmmem.exe processes.
This mapping is necessary for practical operations such as saving a VM’s state to disk, including its full memory contents.
The direct corollary is that any successful compromise of the host can give an attacker access to the complete memory of every VM running on that node. Keeping the host secure is therefore critical.
In that context, hosting a web service that is directly reachable from any guest VM and running it on the secure host side created a significantly larger attack surface than I expected.
In that same period, another team introduced the Metadata Security Protocol, which aims to enhance the security of Azure metadata services by adding HTTP headers that contain a hash-based message authentication code.
While this new protocol is a welcome addition to mitigate illegitimate requests, it does not address the core concern I had about an attack directed at the web server itself.
Many VM escape exploits exploit vulnerabilities in the virtual device drivers that sit halfway between the host and the VMs.
Running a web server on the host OS with unsecured endpoints exposed to guest VMs, whether signed or not, poses a greater security risk.
My recommendation was to remove WireServer and IMDS from the nodes entirely, a view shared without reservation by a VP security architect, author of a popular book about threat modeling, with whom I shared my concerns.
Upon further digging, I discovered that WireServer was maintaining in-memory caches containing unencrypted tenant data, all mixed in the same memory areas, in violation of all hostile multi-tenancy security guidelines.
It is conceivable that, with a little poking, an attacker could obtain data, including secrets such as certificates, belonging to other tenants on the node.
Moreover, the code was leaking cached entries and even entire caches due to misunderstood memory ownership rules, and suffered from a large number of crashes, in the order of 300,000 to 500,000 crashes per month for the WireServer web server alone across the fleet.
New code was throwing C++ exceptions in a codebase that was originally exception-free. The team had coding guidelines in direct contradiction of those of the larger organization, and their testing practices didn’t include long-running tests, so they missed memory leaks and other defects.
The team had reached a point where it was too risky to make any code refactoring or engineering improvements. I submitted several bug fixes and refactoring, notably using smart pointers, but they were rejected for fear of breaking something.
This further illustrates the pervasive gap in technical leadership throughout the organization.
I described the WireServer/IMDS subsystem running on each Azure node as a “walking security liability,” which should be moved out of the nodes, a view shared by many stakeholders outside the organization. The team’s plan for Overlake was to repeat the same thing under a different name, thereby exposing the Azure Boost SoC to any guest VM through a direct network connection.
These services should be hosted as first-party cloud services, with a credential/secrets cache inside each VM that needs it, containing only that VM’s secrets, encrypted with the help of a vTPM where applicable.
This arrangement would also have worked well in bare-metal scenarios as an opt-in package leveraging the physical TPM.
The org’s leadership responded with strong defensiveness and denial. Not long afterward, the organization terminated my employment.

https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-841

How Microsoft Vaporized a Trillion Dollars, Pt. 5

Read distraction-free on Substack
(Continued from Part 4)
If you hear hoofbeats, think horses, not zebras.
Microsoft rushed Azure out of the gates under intense competitive pressure. Corners were cut. Fundamental principles of reliability and operational simplicity were quietly abandoned.
The company formalized the idea that defects could be fixed through human intervention on live production systems, all to accelerate time-to-market and secure major federal cloud contracts. As VP-level executives later admitted, the “digital escort strategy” helped the company “go to market faster.”
Instead of going back to the drawing board to tackle the growing technical debt, Microsoft relied on quick fixes: layers of automation running mitigation scripts, a growing team of on-call staff, and, when automation was not enough, manual repairs.
Public reports revealed hundreds of these interventions monthly on sensitive government clouds alone. In reality, across the much larger commercial fleet, the total number of interventions was significantly higher.
OPEX and support engineers accessing privileged parts of the system submit a Just-In-Time (JIT) request for approval, which is broadcast on a dedicated mailing list. Any full-time member of the organization can approve these requests. Once approved, the requester receives 8 hours of system access, during which they can interact with physical nodes and fabric controllers, and manage secrets when the requested access level is set to RdmSecretsAdministrator.
In just over two months, from August 14, 2024, to October 26, 2024, the Outlook folder I created to separate JIT requests from other messages collected 14,209 requests — nearly 200 per day.
What may have started as temporary workarounds became standard procedures, just part of doing business. Azure never operated as smoothly or independently as promised. What Microsoft presented to the world, and to its most demanding customers, was a sophisticated system perpetually on life support.
This foundational fragility, rooted in rushed decisions and wishful thinking about how fast the platform could grow and stabilize, led to small but ongoing disruptions. Over time, those disruptions built up.
The result was a classic butterfly effect: internal flaws in Azure node software quality, testing discipline, and architectural clarity spread outward, undermining the execution of high-visibility commitments.
By early 2025, OpenAI — still nominally under Microsoft’s right of first refusal — began aggressively diversifying its compute footprint.
The visible consequences quickly became evident: Wall Street grew doubtful despite record profits, and investor confidence sharply declined. From its peak in late October 2025, Microsoft’s stock dropped over 30% in the following months, wiping out more than a trillion dollars in market value.
The hoofbeats had been present all along.
Hindsight makes the better path clear: pause aggressive feature velocity, invest heavily in stabilizing the core node stack, simplify the agent ecosystem, and rebuild testing and ownership discipline before layering on ambitious offload projects or promising bare-metal capabilities to flagship customers.
But that path was never pursued. The organization had already adapted to constant firefighting. More importantly, Microsoft no longer had the deep senior systems talent — the experienced kernel, virtualization, and distributed-systems engineers who built the original Fabric — needed for such a fundamental overhaul.
Replacing or re-architecting a system of Azure’s scale and complexity is like swapping an airplane’s engines mid-flight. Not impossible in theory, but extremely risky in practice, especially when the crew has changed and the original expertise has mostly left.
The reality is clear: there is no quick fix. Azure is in a deep structural hole, and the company must now operate with the platform it has while stabilizing it under full load.
The situation was salvageable, though. In 2024, I read the OpenAI PM specs, which detail the demands and promises Azure made to meet their needs.
The current plans are likely to fail — history has proven that hunch correct — so I began creating new ones to rebuild the Azure node stack from first principles.
A simple cross-platform component model to create portable modules that could be built for both Windows and Linux, and a new message bus communication system spanning the entire node, where agents could freely communicate across guest, host, and SoC boundaries, were the foundational elements of a new node platform. Those ideas were widely shared through written documents, with some presented at a high-profile cross-organization technical meeting.
Some of OpenAI’s requests for their future bare-metal nodes, which would have allowed them to extract the last few percent from the hardware, required extensions to the Overlake card itself. I drafted these extensions and shared them with a division’s Technical Fellow, a renowned kernel architect who had recently shifted to Azure and whom I knew from my previous tenure in the kernel team.
The improvements might have been part of Overlake 4, the next major version of the Windows Boost offloading platform, and a software-only implementation could have been deployed in the meantime to enable true read-only remote system images and fast system resets, a useful feature that allows for quick experimentation and rollbacks common in research domains.
I created a new code repository that adheres to the latest Azure governance standards and began developing actual components, aiming to set an example and build momentum.
I solved the “million files deletion problem,” which seems simple but still needs careful handling to run reliably at cloud scale. Next, I built an encrypting LRU cache to separate tenants’ data and follow basic security principles in hostile multi-tenancy environments. Still fairly simple, but that’s the goal of componentization.
These components could be called directly from existing code, significantly enhancing resilience and security with minimal changes beyond deletions.
The practical strategy I suggested was incremental improvement, where code sections are isolated and replaced with a simple call to a new component: choose an area, develop and thoroughly test a reliable, reusable replacement, then remove the old code and replace it with a call to the new component.
This strategy goes a long way toward modernizing a running system with minimal disruption and offers gradual, consistent improvements. It uses small, reliable components that can be easily tested separately and solidified before integration into the main platform at scale.
Eventually, there is nothing left to carve out, and the original components are just skeletons calling into new ones. Componentization also enables moving elements around; for example, a secure cache could be used on the offload accelerator, on the host, inside a guest VM, a guest L1/L2 container, or on a bare-metal node, with a uniform message bus connecting all parts.
This vision was met with disdain among lower-level management in Azure Core, who may not have understood the urgency — or the scale — of the changes needed to make the platform truly scalable while lowering long-term OPEX costs.
Gradual enhancement through componentization challenged the status quo of constant firefighting and the comfort of familiar, yet fragile, code paths.
In the end, the organization chose the easiest route at the moment: keep adding complexity on a fragile foundation instead of investing in a careful, step-by-step modernization that could have restored autonomy and reliability.
The outcome was expected. High-profile commitments fell through, customer trust continued to decline, and the internal weaknesses I had pointed out from the start kept showing in more visible ways outside the Redmond campus.
What started as engineering disagreements turned into something bigger: a test of whether Microsoft could still perform at the level its most strategic customers and partners expected.
The hoofbeats grew louder. Over the following months, I extended my concerns beyond my direct managers.

https://isolveproblems.substack.com/p/how-microsoft-vaporized-a-trillion-b73

How Microsoft Vaporized a Trillion Dollars, Pt. 6

Read distraction-free on Substack
(Continued from Part 5)
Over the following months, with the patterns I had documented — agent sprawl and testing gaps, the continuous influx of crashes, the security surface in foundational services, and the repeated preference for short-term mitigations over structural fixes — all becoming increasingly difficult to contain at the working level, I extended my concerns upward.
On November 19, 2024, I sent a detailed letter to the Executive Vice President of Cloud + AI.
It laid out the technical findings in full, referenced the leadership gaps I had observed, and included concrete proposals for addressing the root causes.
On January 7, 2025 — still months before any public indication of strain in the OpenAI relationship — I sent a more concise executive summary to the CEO.
The letter opened with the potential risks to national security and to Microsoft’s core business, then followed with a compact set of bullets summarizing the key issues in the Azure node stack and the organizational challenges I had seen.
It also noted that I stood ready to help lead a first-principles reconstruction of the Azure node management layer, if given the opportunity in the right capacity.
When those communications produced no acknowledgment, I took the customary step of writing to the Board through the corporate secretary.
That letter referenced the lack of response to the earlier messages, attached the communication sent to the CEO, and observed that the quasi-loss of OpenAI and the related issues appeared preventable given the advance warnings.
In the months that followed, I received no reply — not a single acknowledgment, question, request for clarification, or confirmation of receipt — from the EVP, the CEO, or the Board.
This complete absence of any feedback added its own dimension.
The issues had been surfaced in calibrated, good-faith communications well in advance of visible customer shifts.
Public optimism around Azure capabilities and strategic commitments continued at full pace. Yet the ground-level signals simply produced silence.
The series began with a single engineer’s shock on his first day back in the organization.
It ends with the same observation, now seen at every layer: the foundational problems in the node stack were visible, the operational and security consequences were measurable, and the proposed paths forward were concrete.
At no level did those signals generate a response.
The hoofbeats I mentioned in the previous installment had become audible far beyond the Azure Core buildings on the West Campus.
Whether they were heard at the top remains unknown.

이 기사는 저작권자의 CC BY 4.0 라이센스를 따릅니다.