K8s is awesome, but at the same time, it’s complicated!

K8s is Awesome!

Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. So it’s a container orchestration tool, but it is just not limited to it as does storage orchestration, service discovery, load balancing, automated rollouts, brings in self-healing, provides secret and configuration management, provides horizontal scaling, and is a declarative way to define a cluster state.

Source: https://kubernetes.io/docs/concepts/overview/components/ (open source)

The biggest use-case for K8s is that it encourages micro-service based architecture. It does so by allowing micro-services to be independently scaled.

But at the same time, it’s complicated!

As you scale up K8s, bring in more services, the complexity grows dramatically.

Few prominent issues that I…

Writing and debugging Jsonnet code is not straight-forward, but let’s see what we have at our disposal.

As a data scientist and a software engineer, I have been using YAML files for defining jobs to train machine learning models. Recently, I switched to, data templating lazy language by Google, to DRYup (Don’t Repeat Yourself) the configuration code. The most important benefit that I got was to be able to reuse the templates and hence less maintenance.

Here is how one with knowledge of any programming language (such as Python), can get up to speed with Jsonnet.

· Getting started with…

Instead of TDD (Test Driven Development) which forces you to think about tests first, one can practice TPD (Test Paralleled Development). In TPD, you start with code development, but during development, you check the correctness of the code by writing tests and executing them (instead of running the code directly or using the console).

Photo by David Clode on Unsplash

Table of Contents

· The unittest module: Unit Testing Framework
∘ _The unittest Basics
∘ _Running Tests using unittest Module
· The unittest.mock module: Mock Object Library
∘ _Mock Class: Mocking objects and/or an attribute
∘ _The MagicMock Class
∘ _Patching imports with patch
∘ _Mock Helpers

The unittest module: Unit Testing Framework


  • The…

Orchestrate parallel jobs on K8s with the container-native workflow engine.

Photo by frank mckenna on Unsplash

Table of Contents

Argo CLI
Deploying Applications
Argo Workflow Specs

Argo Workflows is an open-source container-native workflow engine for orchestrating parallel jobs on K8s. Argo Workflows are implemented as a K8s CRD (Custom Resource Definition). As a result, Argo workflow can be managed using kubectl and natively integrates with other K8s services such as volumes, secrets, and RBAC. Each step in the Argo workflow is defined as a container.

  • Define workflows where each step in the workflow is a container.
  • Model multi-step workflows as a sequence of tasks or capture…

JSON Schema and OpenAPI can seem similar but have different use-cases.

To begin, how JSON Scheme and OPenAPI differ? Well, in contrast to JSON Schema, an OpenAPI document is a definition for an entire API, not just data models. One might compare JSON Schema with the OpenAPI data model.

Why the need to validate JSON?

There are a plethora of use-cases, but let me explain why I use it:

Enter the world of Kubernetes and you’ll find yourself surrounded by object manifests which are either defined as YAML or JSON. But having to maintain thousands of such manifests can be a nightmare if your code is…

Software Engineering, Systems

Don’t get confused with these two similar but different patterns, and know which one to use when.

Photo by José Pablo Domínguez on Unsplash

This difference is important to not just generic Software Engineers, but also to Data Engineers and is the basis for the understanding event-driven architectures for data pipelines.

Let’s look at both of them individually, before we eventually list-out the differences.

Observer Pattern

“The observer pattern is a software design pattern in which an object, called the subject, maintains a list of its dependents, called observers, and notifies them automatically of any state changes, usually by calling one of their methods.” — Wikipedia definition [1]


When do I need to use MapReduce? How can I translate my jobs to Map, Combiner, and Reducer?

Photo by Brooke Lark on Unsplash

MapReduce is a programming technique for manipulating large data sets, whereas Hadoop MapReduce is a specific implementation of this programming technique.

Following is how the process looks in general:

Map(s) (for individual chunk of input) ->
- sorting individual map outputs ->
Combiner(s) (for each individual map output) ->
- shuffle and partition for distribution to reducers ->
- sorting individual reducer input ->
Reducer(s) (for sorted data of group of partitions)

Hadoop’s MapReduce In General

Hadoop MapReduce is a framework to write applications that process…

Let’s uncover the practical details of Pandas’ Series, DataFrame, and Panel

Photo by Stan Y on Unsplash

Note to the Readers: Paying attention to comments in examples would be more helpful than going through the theory itself.

Pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML framework support pandas data structures as inputs.

Pandas Data Structures

Refer Intro to Data Structures on Pandas docs.

The primary data structures in pandas are implemented as two classes: DataFrame and Series.


To fully utilize the power of shell scripting (and programming), one needs to master Regular Expressions. Certain commands and utilities commonly used in scripts, such as grep, expr, sed and awk use REs.

In this article we are going to talk about Regular Expressions. Below is “Table of Contents” for you to get gist of what is going to be covered and help in navigation:

· What is Regex? · Regex Metacharacters · How a Regex Engine works internally? · Character Sets (or Classes): [ ] · Word Sets (Alternation): | · The Dot · Anchors · Repetition (?, *…

Data processing with a general-purpose distributed data processing engine.

Photo by Scott Webb on Unsplash

Apache Spark, written in Scala, is a general-purpose distributed data processing engine. Or in other words: load big data, do computations on it in a distributed way, and then store it.

Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.

To run Spark, you can either spin your own cluster or use…

Munish Goyal

Designing and building large scale applications/APIs, ambitious data models, and workflows! https://goyalmunish.github.io

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store