Home > Back-end >  How can I read an Azure Blob Storage file direclty from an Azure Databricks Notebook
How can I read an Azure Blob Storage file direclty from an Azure Databricks Notebook

Time:07-06

When I run the below code locally it works, but when I run it inside of Azure Databricks it hangs forever and never stops running. I know that the endpoint and sasToken is correct because it works locally. But it does not work when I run it directly from an Azure Databricks Notebook. Any ideas?

import com.azure.storage.blob.BlobClientBuilder
import java.io.InputStream

val input: InputStream = new BlobClientBuilder()
      .endpoint(s"https://<storage-account>.blob.core.windows.net")
      .sasToken("<sas-token>")
      .containerName("<container-name>>")
      .blobName("<blob-name>")
      .buildClient()
      .openInputStream()

CodePudding user response:

Make sure to check whether Secure transfer Enable or not. Go to azure storage account -> Settings go to the configuration you will find Secure transfer. If not, enable it. The secure transfer provides the security of your storage account by only allowing requests to the storage account by a secure connection.

Is there any alternative options available ?

There are different alternative options available: Reading Azure Blob Storage file directly from Azure Databricks Notebook.

CodePudding user response:

I solved this by using shaded jars (https://maven.apache.org/plugins/maven-shade-plugin/) within my app. This example here helped me walk through setting that up. https://github.com/anuchandy/azure-sdk-in-data-bricks. See below for an updated example. Now I can prefix my import with the shaded group id that I created in my POM plugin config. My code in Databricks now knows exactly what dependency to use when reading from blob storage.

import <MY.GROUP.ID>.com.azure.storage.blob.BlobClientBuilder
import java.io.InputStream

val input: InputStream = new BlobClientBuilder()
      .endpoint(s"https://<storage-account>.blob.core.windows.net")
      .sasToken("<sas-token>")
      .containerName("<container-name>>")
      .blobName("<blob-name>")
      .buildClient()
      .openInputStream()

Azure Blob Storage Dependency:

<dependency>
    <groupId>com.azure</groupId>
    <artifactId>azure-storage-blob</artifactId>
    <version>12.14.0</version>
</dependency>

Maven Shade Plugin:

    <plugin>
        <groupId>org.apache.maven.plugins</groupId>
        <artifactId>maven-shade-plugin</artifactId>
        <version>3.2.4</version>
        <executions>
            <execution>
                <phase>package</phase>
                <goals>
                    <goal>shade</goal>
                </goals>
                <configuration>
                    <minimizeJar>true</minimizeJar>
                    <filters>
                        <filter>
                            <artifact>*:*</artifact>
                            <excludes>
                                <exclude>META-INF/*.SF</exclude>
                                <exclude>META-INF/*.DSA</exclude>
                                <exclude>META-INF/*.RSA</exclude>
                            </excludes>
                        </filter>
                    </filters>
                    <transformers>
                        <transformer implementation="org.apache.maven.plugins.shade.resource.ManifestResourceTransformer">
                            <mainClass><MY.MAIN.CLASS></mainClass>
                        </transformer>
                        <!--Transforms META-INF/services (essential for azure-core relocation)-->
                        <transformer implementation="org.apache.maven.plugins.shade.resource.ServicesResourceTransformer"/>
                    </transformers>
                    <relocations>
                        <relocation>
                            <pattern>com.fasterxml.jackson</pattern>
                            <shadedPattern>${project.groupId}.shaded.com.fasterxml.jackson</shadedPattern>
                        </relocation>
                        <!--In Databricks 10.2 you nay also need to relocate reactor netty classes-->
                        <relocation>
                            <pattern>io.netty</pattern>
                            <shadedPattern>${project.groupId}.shaded.io.netty</shadedPattern>
                        </relocation>
                        <relocation>
                            <pattern>reactor</pattern>
                            <shadedPattern>${project.groupId}.shaded.reactor</shadedPattern>
                        </relocation>
                        <relocation>
                            <!--Databricks brings its own version of azure-core which may be incompatible with blob storage version. Relocate azure-core so we don't collide with it-->
                            <pattern>com.azure</pattern>
                            <shadedPattern>${project.groupId}.shaded.com.azure</shadedPattern>
                        </relocation>
                    </relocations>
                </configuration>
            </execution>
        </executions>
    </plugin>
  • Related