databricks - Use the Databricks API the Ruby way
Description
This Rubygem gives you access to the Databricks REST API using the simple Ruby way.
Requirements
databricks
only needs Ruby to run.
Install
Via gem
$ gem install databricks
If using bundler
, add this in your Gemfile
:
gem 'databricks'
Usage
The API is articulated around resources hierarchy mapping the official Databricks API documentation.
It is accessed using the Databricks#api
method, giving both the host to connect to and an API token.
Example to list the root path of the DBFS storage of an instance:
require 'databricks'
databricks = Databricks.api('https://my_databricks_instance.my_domain.com', '123456789abcdef123456789abcdef')
databricks.dbfs.list('/').each do |file|
puts "Found DBFS file: #{file.path}"
end
Here is a simple code snippet showing the most common examples of the API.
require 'databricks'
databricks = Databricks.api('https://my_databricks_instance.my_domain.com', '123456789abcdef123456789abcdef')
# ===== DBFS
databricks.dbfs.list('/').each do |file|
puts "Found DBFS file: #{file.path} (size: #{file.file_size})"
puts 'It is a directory' if file.is_dir
end
databricks.dbfs.put('/dbfs_path/to/file.txt', 'local_file.txt')
puts databricks.dbfs.read('/dbfs_path/to/file.txt')['data']
databricks.dbfs.delete('/dbfs_path/to/file.txt')
# ===== Clusters
databricks.clusters.each do |cluster|
puts "Found cluster named #{cluster.cluster_name} with id #{cluster.cluster_id} using Spark #{cluster.spark_version} in state #{cluster.state}"
end
cluster = databricks.clusters.get('my-cluster-id')
new_cluster = databricks.clusters.create(
cluster_name: 'my-test-cluster',
spark_version: '7.1.x-scala2.12',
node_type_id: 'Standard_DS3_v2',
driver_node_type_id: 'Standard_DS3_v2',
num_workers: 1,
creator_user_name: 'me@my_domain.com'
)
new_cluster.edit(num_workers: 2)
new_cluster.delete
# ===== Jobs
databricks.jobs.list.each do |job|
puts "Found job #{job.name} with id #{job.job_id}"
end
new_job = databricks.jobs.create(
name: 'My new job',
new_cluster: {
spark_version: '7.3.x-scala2.12',
node_type_id: 'r3.xlarge'
num_workers: 10
},
libraries: [
{
jar: 'dbfs:/my-jar.jar'
}
],
timeout_seconds: 3600,
spark_jar_task: {
main_class_name: 'com.databricks.ComputeModels'
}
)
puts "Job created with id #{new_job.job_id}"
new_job.reset(
new_cluster: {
spark_version: '7.3.x-scala2.12',
node_type_id: 'r3.xlarge',
num_workers: 10
},
libraries: [
{
jar: 'dbfs:/my-jar.jar'
}
],
timeout_seconds: 3600,
spark_jar_task: {
main_class_name: 'com.databricks.ComputeModels'
}
)
new_job.delete
# Get a job from its job_id
found_job = databricks.jobs.get(666)
# ===== Instance pools
databricks.instance_pools.each do |instance_pool|
puts "Found instance pool named #{instance_pool.instance_pool_name} with id #{instance_pool.instance_pool_id} and max capacity #{instance_pool.max_capacity}"
end
instance_pool = databricks.instance_pools.get('my-instance-pool-id')
new_instance_pool = databricks.instance_pools.create(
instance_pool_name: 'my-pool',
node_type_id: 'i3.xlarge',
min_idle_instances: 10
)
new_instance_pool.edit(min_idle_instances: 5)
new_instance_pool.delete
# Get an instance pool from its instance_pool_id
found_pool = databricks.instance_pools.get('my-pool-id')
Change log
Please see CHANGELOG for more information on what has changed recently.
Testing
Automated tests are done using rspec.
To execute them, first install development dependencies:
bundle install
Then execute rspec
bundle exec rspec
Contributing
Any contribution is welcome:
- Fork the github project and create pull requests.
- Report bugs by creating tickets.
- Suggest improvements and new features by creating tickets.
Credits
License
The BSD License. Please see License File for more information.