During these uncertain times, as most of us are staying inside, it’s a great time to try and learn new things. Last week we noticed few posts in Linkedin by professionals trying out Databricks and posting their initial opinions. That sparked an idea to summarize and write a short, 20 rule list as a helper for everyone who decides to try using Databricks in production. We are two data engineers (Robertas Sys and me, Eimantas Jazonis) who spend years developing and improving our production level data science platform on Azure. Our current setup is mainly based on Databricks, so feel that we gathered enough experience to share with you all.
Why Databricks?
Let’s start by saying that Databricks is awesome. For those that don’t know - Databricks is a product developed by original Apache Spark creators. It’s a Jupyter style collaborative environment with full Apache Spark support. There is a lot to talk about it - you can read it up on their official website here or you can attend recently announced yearly Spark + AI summit 2020 which will be held online free of charge (thank you, not thank you Covid-19).
However, even as a fine product - there are things to keep in mind, especially going into production. And that’s where this rule set comes into play. We gathered 20 rules to live by with Databricks in production on Azure and we hope that after reading this we will make your trip a smooth sale.
The Rules
Rule 1 - Minimum private and virtual network size
While setting up Databricks resource if you want to connect it under virtual network you need CIDR range to be at least /26
(you also need 2 ranges - one for public and one for private subnet).
Rule 2 - You should consider using a separate workspace per team/unit
Databricks uses shared data storage, called dbfs and you cannot control user access on it. This means that every user who has an access to the workspace can also view and modify the stored files. This is obvious not the right way to handle sensitive data.
Rule 3 - Store secrets in key-vault backed scopes, not in code
Look, everyone knows that storing plaintext passwords is not a great idea. Yet - everyone does it at some point in time. When going to production you should use a key-vault store, which is a very convenient and easy to use Azure product. To separate permissions it’s a good practice to create a key-vault for every role.
Rule 4 - CI/CD is not straightforward (but when was it?)
In our current setup we use Github with Azure Devops integration. Although not fully intuitive, Databricks supports branching, committing and creating pull requests straight from the notebook UI. Using Databricks API we can copy files from /dev
and put it to /prod
location from which our daily pipeline runs.
Rule 5 - Learn to use functional programming paradigms
You can live by without it, but to make your code sparkling clean (SPARKling - see what we did there?) you should incorporate it to your codebase. Here’s a a very useful post on how to use custom PySpark dataframe transformations. A small code snippet from the linked post:
#### This:
final_df = df.transform(with_function) \
.transfomr(with_another_function)
#### Looks way better than this:
partial_df = with_function(df)
final_df = with_another_function(partial_df)
Rule 6 - Avoid using multiple languages
We learned this the hard way. You need to put in some work to manage one language and all it’s dependencies (helpers, readers, writers, libraries, etc.). Imagine doing this to 3 languages at once (R, Python, Scala). Moreover, in our opinion - avoid R at all. It’s a great statistical language, but using it with Spark is difficult and not worth your effort.
Rule 7 - Avoid using native API (spark.write, spark.read)
By creating wrapper libraries we can encapsulate connection logic, while maintaining metadata, logging, tracing and user accesses. In our production environment we implemented the following structure:
Our main reader/writer class implements interface functionality which holds the specific storage logic. This way we can achieve all the features mentioned in this rule.
Rule 8 - Try to avoid Pandas
There is no big reason why you can’t use Pandas in Databricks. The question is - why would you do that? Anything you might want to do with your data you can do with Spark with distributed computing power. As we are not trying to start the flame wars in the comment section - it’s fine to use Pandas for small exploration tasks, but when it comes to big data - stick to Spark.
Rule 9 - Use job clusters for production pipelines to reduce costs up to 50%
If the number 50% seems to come from our imagination - it’s not. Just few weeks back we walked through our pipeline and managed to save more than 4,000$ per month on our pipeline run cost. Now that’s impressive! One of the main cost reduction method was to migrate our daily pipeline from interactive to job clusters. It can seem counter intuitive spinning up multiple parallel clusters with hundred of cpu’s in total, but at the end of the day - numbers speak for themselves.
Rule 10 - Don’t use init scripts on clusters
You know how it goes - someone was a bit tired and pushed new version of xgboost on Friday night and on Monday your pipeline is failing at every step. Or the time it takes to start new cluster doubled overnight. One possible thing to blame - cluster runtime version and library conflict. It is always a better idea to install your libraries via dbutils
, as it saves time and effort in the long run (it also speeds up your cluster start time significantly).
Rule 11 - Keep data as close as possible to reduce latency
The rule is a bit self explanatory - try to select the data storage location as close to you as possible to reduce latency.
Rule 12 - Use delta format
If you haven’t heard about Delta format - you can read about it here or here (my blog post). It’s a technology build on top of parquet file format with support to ACID transactions, performance improvements and delta tables support. Use it, it’s awesome!
Rule 13 - Cluster runtime version matters
Once, we had massive pipeline failure across all notebooks just because the newest library version was not supported by cluster runtime version (cluster runtime version is also tied together with the Python version inside). Be vary and try to manage your libraries inside the notebooks (back to rule #10).
Rule 14 - Keep in mind data format caveats
There are some ins and outs for every data format that can be used in data lake. For example - delta format persists cache on cluster level, which means you have to restart the cluster after each major dataframe overwrite. ORC format on other hand doesn’t support ACID transactions, so if you face a failure while saving the dataframe, the dataframe will be ruined, thus requiring to reload it completely.
Rule 15 - Avoid “push-down” calculations as much as possible
While working with RDBMS like Azure Synapse Analytics try to avoid push-down calculations as much as possible. Spark can handle most of the transformations way faster. For large datasets it is worth loading the data from RDBMS to data lake before performing the data manipulation.
Rule 16 - Notebooks are not python files
For better of for worse - Databricks notebooks are not Python files and you will have to learn to deal with it. No, you can’t create your helper libraries like Python modules, learn to work with objects/classes. No, you can’t pass the notebook to pycco for automated documentation, but you can still use the plain pydocs
with help()
function. One small solution we implemented for our data scientists to get the pythonic feeling of modules:
# Notebook that is used for importing helper functions
class DataHelper():
def get_max_date():
pass
# ----- Next cell -----
dh = DataHelper()
print("DataHelper object already initialized - use dh.[method_name]")
print("For help use help(dh)")
Easy and super simple workaround.
Rule 17 - Change broadcast time for resource heavy transformations
While working with large dataframes if the job fails with not very descriptive error - try changing broadcast timeout value. This happens because Spark tries to do broadcast hash join and one of the DataFrames is very large, so sending it consumes a lot time, giving the error.
spark.conf.set("spark.sql.broadcastTimeout", [NEW VALUE, FOR EXAMPLE 3600])
Rule 18 - Spark UI and Ganglia is your friend
You can monitor Spark job execution and cluster resources in Spark UI and Ganglia tabs. Use it, it can save you loads of time while trying to figure out what’s happening.
Rule 19 - Don’t be surprised by occasional internal Databricks errors and slowdowns
It happens. Don’t panic. You can always create a critical support ticket if you fancy support calls at 4:00AM in the morning :)
Rule 20 - People from BI department can still use Databricks (Spark SQL)
With the help of Delta tables and SQL support, even people with little to no coding skills can still use Databricks. For that purpose we created automatic Delta table generator - in case new dataframe is saved to data lake, we automatically create a Delta table on top of that. That way users can still query data and make transformations using SQL.
End word
That’s it for this time! We hope that you find this list interesting and useful in the future. Thank you for reading and see you in the next one!
Best - R and E.