Navigating the Data Landscape

Big data and artificial intelligence are reshaping the data world.

Analytics has come a long way from using a simple database and a spreadsheet to help drive important organizational decisions. Today, the systems and skills needed to use data analysis are more sophisticated which opens exciting opportunities.

What does that landscape look like and how can we make decisions to empower new opportunities?

Driving Innovation

Artificial intelligence opens entirely new opportunities for organizations.

Smart Logistics

Data from operational systems and sensors allows you to understand how you interact with customers and suppliers in order to improve operational efficiency in the supply chain, warehouse, and delivery channels.

Customer Relationships

Information from customer relationship software, support systems, web and mobile, email/SMS, and in-person interaction can be combined to understand how you deliver your products and services. This allows you to improve quality of service, make informed decisions about product features, improve customer experience, and reduce churn.

Recommendations, Personalization, and New Products

Big data allows you to integrate data from many sources in order to build rich customer profiles. Such profiles allow you to make unique recommendations and personalize your offerings. This enables better experiences and higher customer conversions.

Using data about how customers interact with your existing services allows you to identify new opportunities and how to improve your existing offerings.

Public Health

The ability to aggregate intelligence about risks to public health allow for governments and public organizations to respond quickly. Pipelines that combine information about outbreaks from multiple sources can help inform timely action while keeping the public and others accurately informed.

Building Data Driven Organizations

Providing value through analytics requires you to have data, enrich it with other information, and make it broadly available to the needed stakeholders. This requires strong teams working together.

No organization becomes a "Data Science," "Machine Learning", or "Artificial Intelligence" powerhouse overnight. Rather, the capabilities are built gradually, line upon line, with more advanced capabilities building upon earlier successes.

Data is Multi-disciplinary

The effective practice of Data Science is about building teams and culture.

Software Development

Software engineers automate processes, create interfaces to make information available, and build entirely new systems.

Data Engineering

Data engineers have deep understanding about the sources of information in an organization and create the systems required to transform data into intelligence. Working with data scientists and software engineers, they also create the infrastructure to transform insight into action.

Data Science

Data scientists help organizations ask important questions using data.

Business Analysts

Business analysts have deep domain knowledge and help to frame data insights within the larger picture of an organization's operations and mission.

Data Growth

The amount of data produced each year has increased exponentially. Per IDC, global data generation is on track to reach 180 zettabytes by 2025.

Technology to the Rescue

As the amount of data available has increased, new technologies have emerged that allow us to mine and analyze the information.

Storage

Data needs to a place to live.

Compute

Compute provides the brains of Big Data.

Streaming

Streaming allows insight in real time.

Big Data Timeline

Storage

  • Google File System. In October of 2003, Google published a paper describing large-scale, redundant, distributed storage of files.
  • HDFS (2006). Work was started on a large-scale redundant, distributed storage.
  • OpenStack Swift (2009). Swift was started to provide a self-hosted alternative for Amazon S3.
  • Ceph (2010). Created to provide a robust unified storage model, included support for S3 API.
  • MinIO (2014). Light-weight version of the S3 API for development, testing, and small-scale cluster deployments.

Compute

  • MapReduce. In 2004, Google published a second white paper describing a new programming and storage model for processing information.
  • Hadoop/YARN. In 2006, the Hadoop project started work on an Open Source Implementation of Google's MapReduce.
  • Apache Spark. In 2009, Matei Zaharia started Apache Spark to address some of the processing limitations of Apache MapReduce.
  • Docker. In March 2013, Docker was released, providing a toolset to create containerized apps to run on Linux.
  • Kubernetes. In June 2014, Google released Kubernetes, an "orchestration engine" used to manage Linux containers at scale.
  • Spark on Kubernetes. In version Spark 2.4.4, Apache Spark announced beta support for running on Kubernetes clusters, with full support expected in Spark 3.0.

Open Source Data Geography

Data technologies can be grouped into distinct groups, inter-related regions of interopability which empower more sophisticated capabilities.

Infrastructure/Cloud

Modern data systems are powered by cluster computing provided by virtualization, containers, and the cloud.

Runtime

Building on the elastic capabilities of cloud, databases (MySQL, PostgreSQL, Oracle), NoSQL storages (Cassandra, MongoDB, Neo4J), search engines (ElasticSearch), distributed storage (HDFS, Ceph, MinIO), and computational engines (Spark, Dask), work together to power data workloads and provide interfaces to ask questions.

Interface

Through the use of business intelligence tools (Apache Superset, Tableau), analytic environments (Jupyter, Zeppelin), and custom software we access the data provided by the runtime to provide insight and take action.