5 Versatile Cloud Plugins for Data Scientists

Use Cases & Projects, Dataiku Product, Scaling AI Najia Gul

This article was written by a guest author, Najia Gul. Najia is a writer and a software engineer. She loves telling stories that are backed by data. Najia has worked with tech companies across the globe to craft and create content on AI and machine learning.

Dataiku orchestrates the entire AI lifecycle, from data access and preparation to model deployment. But the platform is also extensible through plugins, which let you extend the power of Dataiku with your own datasets, recipes, and processors!

A plugin is a reusable component that extends the scope and functionality of native features. With plugins, you can add new features and upgrade your application’s capabilities. Data scientists use plugins to bring in powerful functionalities that extend the native features of an application. 

Dataiku has plenty of built-in cloud plugins available in the plugins store. Using these, you can connect to new data sources, package an algorithm as a visual interface, call REST APIs, integrate your favorite product, and much more. You can also implement your own custom components and package them as a plugin to share with others. 

There are four ways to access these cloud plugins:

  1. Use a built-in plugin from Dataiku’s plugins store.
  2. Create a custom plugin within Dataiku.
  3. Create a custom plugin and upload it as a ZIP file.
  4. Use third-party plugins from a Git repository.

The most useful plugins for a data scientist depend on their specific use case. However, some cloud plugins are particularly beneficial for working with data because of their usefulness, relatability, uniqueness, and impact. 

In this article, we’ll highlight five of the top cloud plugins that are available in the Dataiku plugin store and explore how their versatility and usability make handling your data easier.

green and black abstract

1. API Connect

Whether you are a data scientist or a developer, chances are, you’ll be working with APIs quite often. While Dataiku has a comprehensive set of native capabilities for building, deploying, and managing APIs, sometimes you might want to create a call to a custom REST API dataset from within your project.  

If you want to pull data from other programs inside Dataiku or make calls using REST API, the API Connect plugin is your friend. It’s incredibly versatile and lets you instantly connect to any service.

API Connect lets you connect to and communicate with applications. If the target application’s API requires authentication, the API Connect plugin supports two modes: basic access and tokens. With basic access, you authenticate using your username and password. In the case of tokens, you can use an API key or a bearer token and insert it either in query parameters or the headers of each API call.

Suppose you’re working to collect data about the public response to a topic or product. You might choose to consult a social media channel, like Twitter, and collect data to perform a sentiment analysis. In this scenario, you can use the API Connect plugin to access remote services, such as the Twitter API, and pull relevant data you can later analyze.

2. Google Drive and Google Sheets

As a data scientist, you’ll sometimes have to work with enormous datasets. If your data is stored in Google Drive, you can connect it to Dataiku with a single plugin. The Google Drive plugin provides a connector to read and write data from Google Drive. You can connect to Google Drive to review, write, and even store information about your data.

Sometimes there are gigabytes or even terabytes of data that reside in a Google Drive account. Bringing all of this data at once into the workspace can overload the memory. To avoid out-of-memory errors and speed issues, you can divide the dataset into subsets and load one subset of the data at a time. The Google Drive plugin helps you achieve this by making read/write operations super convenient. 

You’ll need to create a new project on the Google Cloud Platform (GCP) to access your Google Drive documents. From here, you can share your documents by creating a service account and giving it access or directly accessing it using Google’s single sign-on.

You can also use the Google Sheets plugin as a seamless way to interact with Google Sheets, a tool commonly used by data scientists. If you’re working with a spreadsheet that’s frequently updated by collaborators, uploading the updated version of the data every time you work can become inconvenient. With the Google Sheets plugin, you can connect to the data source and pull the latest version of the spreadsheet at any time. 

The Google Sheets plugin also requires that you have a GCP account and access to a service account. Once connected, you can load data that resides in Google Sheets and configure your dataset in Dataiku. If you are running Dataiku in GCP, Dataiku natively supports Google BigQuery, Google Dataproc, Cloud SQL and cloud storage, and Google vision and natural language services. Learn more here

3. Amazon Rekognition Plugin

The Amazon Rekognition plugin lets you use the Amazon Rekognition API within Dataiku. If you are running Dataiku on an Amazon EC2 instance, Amazon Rekognition is supported natively. But either way, this makes sophisticated computer vision tasks like object detection a breeze. You can add sophisticated computer vision to your Dataiku flow. 

With the Amazon Rekognition plugin, you can detect objects in images, obtain labels, and draw bounding boxes. You can also detect text and unsafe content in images. To get started, you’ll need to set up an IAM user profile in your AWS account with full admin rights. 

Say, for example, you’re working on a computer vision project with a dataset of images. The images are of cats and dogs in JPG or PNG format. You can create an Amazon Rekognition recipe that detects the cats and dogs in the images and draws bounding object boxes on each image. Recipes are instructions that make new datasets by performing transformations on existing datasets. The API takes care of preprocessing the image, detecting objects, and post-processing the bounding boxes without you needing to write a single line of code.

When running on AWS, Dataiku features native support for additional AWS services, including S3, Redshift, Glue, Athena, and computer vision and text analytics with Amazon Comprehend and Comprehend Medical. You can build, deploy, and manage SageMaker models and endpoints collaboratively from Dataiku.

4. Jira

If you use Jira for organizing and keeping track of projects, the Jira plugin is a useful addition to your toolbox. It brings all your data residing in Jira inside Dataiku and provides a connector and recipes to read tickets, organizations, issues, sprints, and backlogs from Jira Core, Jira Software, and Jira Service Desk. Together, the Jira plugin and Dataiku give you lots of flexibility in how you use your data.

For instance, say you want to access backlogs from the Service Desk in Jira to analyze how optimizing resource allocation could improve the operational efficiency of an organization. You can create a recipe in Dataiku, select Service Desk as the Jira service, and pass the list of IDs that need to be processed as a parameter. The Jira plugin makes it easier for you to work with data imports. Data import plugins are especially helpful when you want to connect Dataiku to an application that doesn’t have native read/write support.   

5. Crowlingo Multilingual NLP

The amount of digital data continues to grow exponentially. The Crowlingo Multilingual NLP plugin brings Crowlingo’s NLP APIs to your fingertips with a direct pipeline for your data. It saves you a ton of work by creating and fine-tuning NLP models directly from Dataiku. You can find insights and relationships in text data in more than 100 languages without any translation or additional model training.

Suppose you’re working with classifying multilingual text extracted from a social media application to detect hate speech. You want to ensure that the content is safe and unbiased for the users. Using the Crowlingo Multilingual NLP plugin, you can detect toxic or sensitive comments in more than 100 languages without needing additional data. You don’t need to create a custom NLP pipeline for preprocessing the text. Without the plugin, you would need to formulate a solution for detecting hate speech in multiple languages from scratch. The NLP plugin reduces your workload, letting you focus on other important tasks. 

Plugins Are Powerful

Plugins are a great way to upgrade the functionality of an application without changing the code. Dataiku offers a catalog of plugins and connectors, ranging from translation and transcription to more data-forward tasks, like creating data graphs or charts and refining ML models to collect and process your data. 

Any user can view the plugin store and list of installed plugins and use any plugins installed on the Dataiku instance. To install plugins, you’ll need system administrator privileges. Users with the “Develop plugins” permission can contribute to the development of plugins that can be packaged, shared, and reused by others.

You May Also Like

Adoption of Generative AI in Retail & CPG: The Time Is Now

Read More

Alteryx to Dataiku: Path to Automation

Read More

Maximize Your Data Potential Beyond Spreadsheets

Read More

Deloitte Electrified: Designing a Twin of the Electric Grid

Read More