资深 SRE / 基础设施架构师 (Principal DevOps Engineer)

Senior Site Reliability Engineer / Infrastructure Architect (Principal DevOps Engineer)

97EX

＄2.6-3K[Bulanan]

Jarak jauh3-5 Tahun KedaluwarsaS1Penuh waktu

Detail Jarak Jauh

Negara terbuka：Di seluruh dunia

Persyaratan Bahasa：Cina

Deskripsi Pekerjaan

Tampilkan teks asli

Keuntungan

Penghargaan Karyawan
Tim terdistribusi, Tidak Ada Sistem Pemantauan, Tidak Ada Politik di Tempat Kerja
Waktu Istirahat & Cuti
Waktu Istirahat Berbayar, PTO tidak terbatas atau Fleksibel, Cuti yang Diamanatkan Pemerintah

Job Responsibilities 1. Cloud Native Architecture Design and Governance: - Design highly available architectures on AWS and Cloudflare, extending beyond CDN configuration to implement edge logic with Cloudflare Workers and secure access layers using Argo Tunnel/Zero Trust. - Manage AWS multi-account structures via Organizations, architect cross-Region networking (Transit Gateway, VPC Peering, VPN) to resolve complex connectivity and latency challenges. - Enforce Infrastructure as Code (Terraform/Pulumi) across edge rules and underlying resources to minimize manual console operations. 2. Deep Kubernetes Engineering: - Maintain large-scale EKS or self-managed clusters, performing performance tuning and troubleshooting of core components such as etcd, CNI plugins (Cilium/Calico), and CoreDNS. - Develop Kubernetes Operators/Controllers or kubectl plugins to enhance platform automation based on business requirements. - Bridge local development and production environments (Docker Compose to Helm/Kustomize) to ensure consistency. 3. Engineering Productivity and Observability: - Design and maintain complex CI/CD pipelines, integrating code quality analysis (SonarQube), container image security scanning, and automated testing. - Implement GitOps workflows using ArgoCD or Flux. - Build a Prometheus-based monitoring system with in-depth runtime (Go/Java) and system-level (eBPF) performance analysis. 4. System-Level Support and Reliability: - Maintain middleware such as Nginx, Redis, and Kafka with capabilities for source-level debugging and parameter tuning. - Address system bottlenecks under high concurrency (TCP queues, file handles, memory management). - Linux Systems Expert: Deep understanding of Linux kernel internals and proficient use of perf, strace, tcpdump, eBPF, and other tools to diagnose CPU, I/O, and network issues in production. - Cloud and Networking Proficiency: Familiarity with AWS infrastructure limits (API rate limits, EBS IOPS) and Cloudflare fundamentals (Anycast, SSL handshake), with a deep understanding of the TCP/IP stack and HTTP/2/3 protocols. - Kubernetes Hands-On Experience: In-depth knowledge of cgroups and namespaces, service meshes (Istio/Linkerd), and rapid diagnosis of pod scheduling failures or crashes. - Development Skills: Proficient in Go or Python, capable of reading open-source code, fixing bugs, and developing backend tools. Preferred Qualifications - Contributor to CNCF open source projects. - Experience maintaining systems handling hundreds of millions of daily requests. - Hands-on experience implementing chaos engineering in production environments.