Building a serverless data quality and analysis framework with Deequ and AWS Glue

With ever-increasing amounts of data at their disposal, large organizations struggle to cope with not only the volume but also the quality of the data they manage. Indeed, alongside volume and velocity, veracity is an equally critical issue in data analysis, often seen as a precondition to analyzing data and guaranteeing its value

Continue reading

Harness the power of your data with AWS Analytics

2020 has reminded us of the need to be agile in the face of constant and sudden change. Every customer I’ve spoken to this year has had to do things differently because of the pandemic. Some are focusing on driving greater efficiency in their operations and others are experiencing a massive amount of growth.

Continue reading

Bringing machine learning to more builders through databases and analytics services

Machine learning (ML) is becoming more mainstream, but even with the increasing adoption, it’s still in its infancy. For ML to have the broad impact that we think it can have, it has to get easier to do and easier to apply. We launched Amazon SageMaker in 2017 to remove the challenges from each stage of the ML process, making it radically easier and faster for everyday developers and data scientists to build, train, and deploy ML models.

Continue reading

The New Business Models (and Jobs) in Blockchain

From finance to smart cities, distributed ledger technology is beginning to deliver on its vaunted potential in several key sectors.

Given the Bitcoin price craze in the face of the morose economy during the Covid-19 pandemic, one may assume that the distributed ledger technology (DLT)/blockchain bubble is ready to burst once again. However, new developments justify paying close attention to this sector

Continue reading

Building a scalable streaming data processor with Amazon Kinesis Data Streams on AWS Fargate

Data is ubiquitous in businesses today, and the volume and speed of incoming data are constantly increasing. To derive insights from data, it’s essential to deliver it to a data lake or a data store and analyze it. Real-time or near-real-time data delivery can be cost prohibitive, therefore an efficient architecture is key for processing, and becomes more essential with growing data volume and velocity.

Continue reading

black screen with code

Orchestrating analytics jobs by running Amazon EMR Notebooks programmatically

Amazon EMR is a big data service offered by AWS to run Apache Spark and other open-source applications on AWS in a cost-effective manner. Amazon EMR Notebooks is a managed environment based on Jupyter Notebook that allows data scientists, analysts, and developers to prepare and visualize data, collaborate with peers, build applications, and perform interactive analysis using EMR clusters.

Continue reading

application blur business code

Optimizing Spark applications with workload partitioning in AWS Glue

AWS Glue provides a serverless environment to prepare (extract and transform) and load large amounts of datasets from a variety of sources for analytics and data processing with Apache Spark ETL jobs. This posts discusses a new AWS Glue Spark runtime optimization that helps developers of Apache Spark applications and ETL jobs, big data architects, data engineers, and business analysts scale their data processing and batch jobs running on AWS Glue automatically.

Continue reading

1 37 38 39 40 41