Kubernetes (K8s) is an open-source system for automating deployment, scaling, and management of containerized applications. So it’s a container orchestration tool, but it is just not limited to it as does storage orchestration, service discovery, load balancing, automated rollouts, brings in self-healing, provides secret and configuration management, provides horizontal scaling, and is a declarative way to define a cluster state.
The biggest use-case for K8s is that it encourages micro-service based architecture. It does so by allowing micro-services to be independently scaled.
As you scale up K8s, bring in more services, the complexity grows dramatically.
Few prominent issues that I…
Writing and debugging Jsonnet code is not straight-forward, but let’s see what we have at our disposal.
As a data scientist and a software engineer, I have been using YAML files for defining jobs to train machine learning models. Recently, I switched to, data templating lazy language by Google, to DRYup (Don’t Repeat Yourself) the configuration code. The most important benefit that I got was to be able to reuse the templates and hence less maintenance.
Here is how one with knowledge of any programming language (such as Python), can get up to speed with Jsonnet.
Instead of TDD (Test Driven Development) which forces you to think about tests first, one can practice TPD (Test Paralleled Development). In TPD, you start with code development, but during development, you check the correctness of the code by writing tests and executing them (instead of running the code directly or using the console).
· The unittest module: Unit Testing Framework
∘ _The unittest Basics
∘ _Running Tests using unittest Module
· The unittest.mock module: Mock Object Library
∘ _Mock Class: Mocking objects and/or an attribute
∘ _The MagicMock Class
∘ _Patching imports with patch
∘ _Mock Helpers
unittestmodule: Unit Testing Framework
Orchestrate parallel jobs on K8s with the container-native workflow engine.
Argo Workflows is an open-source container-native workflow engine for orchestrating parallel jobs on K8s. Argo Workflows are implemented as a K8s CRD (Custom Resource Definition). As a result, Argo workflow can be managed using
kubectl and natively integrates with other K8s services such as volumes, secrets, and RBAC. Each step in the Argo workflow is defined as a container.
JSON Schema and OpenAPI can seem similar but have different use-cases.
To begin, how JSON Scheme and OPenAPI differ? Well, in contrast to JSON Schema, an OpenAPI document is a definition for an entire API, not just data models. One might compare JSON Schema with the OpenAPI data model.
There are a plethora of use-cases, but let me explain why I use it:
Enter the world of Kubernetes and you’ll find yourself surrounded by object manifests which are either defined as YAML or JSON. But having to maintain thousands of such manifests can be a nightmare if your code is…
Don’t get confused with these two similar but different patterns, and know which one to use when.
This difference is important to not just generic Software Engineers, but also to Data Engineers and is the basis for the understanding event-driven architectures for data pipelines.
Let’s look at both of them individually, before we eventually list-out the differences.
“The observer pattern is a software design pattern in which an object, called the subject, maintains a list of its dependents, called observers, and notifies them automatically of any state changes, usually by calling one of their methods.” — Wikipedia definition 
When do I need to use MapReduce? How can I translate my jobs to Map, Combiner, and Reducer?
MapReduce is a programming technique for manipulating large data sets, whereas Hadoop MapReduce is a specific implementation of this programming technique.
Following is how the process looks in general:
Map(s) (for individual chunk of input) ->
- sorting individual map outputs ->
Combiner(s) (for each individual map output) ->
- shuffle and partition for distribution to reducers ->
- sorting individual reducer input ->
Reducer(s) (for sorted data of group of partitions)
Hadoop MapReduce is a framework to write applications that process…
Let’s uncover the practical details of Pandas’ Series, DataFrame, and Panel
Note to the Readers: Paying attention to comments in examples would be more helpful than going through the theory itself.
Pandas is a column-oriented data analysis API. It’s a great tool for handling and analyzing input data, and many ML framework support pandas data structures as inputs.
Refer Intro to Data Structures on Pandas docs.
The primary data structures in pandas are implemented as two classes:
To fully utilize the power of shell scripting (and programming), one needs to master Regular Expressions. Certain commands and utilities commonly used in scripts, such as
awk use REs.
In this article we are going to talk about Regular Expressions. Below is “Table of Contents” for you to get gist of what is going to be covered and help in navigation:
Data processing with a general-purpose distributed data processing engine.
Apache Spark, written in Scala, is a general-purpose distributed data processing engine. Or in other words: load big data, do computations on it in a distributed way, and then store it.
Spark provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlib for machine learning, GraphX for graph processing, and Spark Streaming.
To run Spark, you can either spin your own cluster or use…