Last Updated: November 16, 2017

The HDFS connector can be used for access and sharing of data on an HDFS storage system. The connector is available as an add-on subscription to organizations with a Globus Standard subscription - please contact us for pricing.

This document describes the steps needed to install an endpoint, and the HDFS connector needed to access the storage system. This installation should be done by a system administrator, and once completed users can use the endpoint to access the HDFS storage via Globus to transfer, share and publish data on the system.

Prerequisites

A functional Globus Connect Server installation is required for installation and use of the HDFS connector. The server can be hosted on any machine that can connect to the HDFS storage system to be used. The Globus Connect Server Installation Guide provides detailed documentation on the steps for installing and configuring a server endpoint.

Supported Linux Distributions

The HDFS DSI is available for the following Linux distributions:

  • RHEL 7

  • CentOS 7

Supported Globus Connect Server versions

The HDFS DSI should be used with the latest version of GCS.

DSI Build

1) Download and untar Hadoop. Latest tested version is 2.8.1. Hadoop does not need special configuration for building the DSI. Make sure the untarred Hadoop is readable by the DSI build process. Record the install location as HADOOP_ROOT.

Example:

# wget http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
# export HADOOP_ROOT=/usr/local
# tar -zxvf hadoop-2.8.1.tar.gz -C $HADOOP_ROOT

2) Install the JAVA SDK and JRE. Latest tested version is 1.8.0. Record the path to the JAVA SDK C includes files as JVM_INCLUDE_DIR. Record the path to the JAVA SDK C libraries as JVM_LIB_DIR.

Example:

# yum install -y jre java-1.8.0-openjdk-devel
# export JAVA_HOME=/usr/lib/jvm/jre
# echo "export JAVA_HOME=${JAVA_HOME}" > /etc/profile.d/java.sh
# export JVM_INCLUDE_DIR=/usr/lib/jvm/java/include
# export JVM_LIB_DIR=/usr/lib/jvm/java/lib

3) Install dependent build packages: gcc, git, autoconf, automake, libtool, cmake, gcc-c++, openssl-devel, and openssl-devel (Note: this list is RHEL 7 specific).

Example:

# yum install -y gcc git autoconf automake libtool cmake gcc-c++ openssl-devel

4) Install the Globus Connect Server repo, the EPEL repo, the yum-plugin-priorities package, and the globus-gridftp-server-devel package.

Example:

# wget https://downloads.globus.org/toolkit/globus-connect-server/globus-connect-server-repo-latest.noarch.rpm
# rpm --import https://downloads.globus.org/toolkit/gt6/stable/repo/rpm/RPM-GPG-KEY-Globus
# yum install -y globus-connect-server-repo-latest.noarch.rpm
# curl -LOs https://dl.fedoraproject.org/pub/epel/epel-release-latest-7.noarch.rpm
# yum -y install epel-release-latest-7.noarch.rpm
# yum -y install yum-plugin-priorities
# yum install -y globus-gridftp-server-devel

5) Download DSI and build the HDFS DSI from source. Latest tested version is 539bb90. Be sure to configure the HDFS_INCLUDE_DIR, HDFS_LIB_DIR, JVM_INCLUDE_DIR, and JVM_LIB_DIR environment variables appropriately.

Example:

# export HADOOP_ROOT=/usr/local
# export HDFS_INCLUDE_DIR=${HADOOP_ROOT}/hadoop-2.8.1/include
# export HDFS_LIB_DIR=${HADOOP_ROOT}/hadoop-2.8.1/lib/native
# export JVM_INCLUDE_DIR=/usr/lib/jvm/java/include
# export JVM_LIB_DIR=/usr/lib/jvm/java/lib

# git clone https://github.com/opensciencegrid/gridftp-hdfs.git
# cd gridftp-hdfs
# cmake CMakeLists.txt
# make

6) The HDFS lib file (and symlink aliases) will now be located in the root of the build directory.

Example:

# ls -lah libglobus_gridftp_server_hdfs.*
lrwxrwxrwx 1 root root  34 Sep 20 15:30 libglobus_gridftp_server_hdfs.so -> libglobus_gridftp_server_hdfs.so.0
lrwxrwxrwx 1 root root  38 Sep 20 15:30 libglobus_gridftp_server_hdfs.so.0 -> libglobus_gridftp_server_hdfs.so.0.0.1
-rwxr-xr-x 1 root root 96K Sep 20 15:30 libglobus_gridftp_server_hdfs.so.0.0.1

DSI Installation and Configuration

The GridFTP service using the HDFS DSI must be installed on a pre configured HDFS client node. Steps 1-2 below describe the minimal configuration for the HDFS client node. The HDFS client, name and data nodes must all have access to the same user account information including group membership. As noted in the DSI Build section, the DSI need not be built on the same system that it is installed on. The instructions below detail what needs to be done to install the DSI. If the build system and install system are the same system, then some of the instructions given below will be redundant with those given in the DSI Build section.

1) Download and untar Hadoop. Latest tested version is 2.8.1. Make sure the untarred Hadoop is readable by the DSI build process. Record the install location as HADOOP_ROOT.

Example:

# wget http://apache.claz.org/hadoop/common/hadoop-2.8.1/hadoop-2.8.1.tar.gz
# export HADOOP_ROOT=/usr/local
# tar -zxvf hadoop-2.8.1.tar.gz -C $HADOOP_ROOT

2) Record the network location of the HDFS Name Node Service as defaultFS in $HADOOP_ROOT/hadoop-2.8.1/etc/hadoop/core-site.xml. The HDFS DSI only supports a single HDFS filesystem.

Example $HADOOP_ROOT/hadoop-2.8.1/etc/hadoop/core-site.xml config:

<configuration>
<property>
<name>fs.defaultFS</name>
<value>hdfs://${HDFS_NAMENODE}:${NAMENODE_PORT}/</value>
</property>
</configuration>

${HDFS_NAMENODE} and ${NAMENODE_PORT} should be replaced with the proper values as appropriate for your HDFS system.

3) Install the JAVA SDK. Latest tested version is 1.8.0. Record the path to the JAVA SDK as JAVA_HOME.

Example:

# yum install -y jre
# export JAVA_HOME=/usr/lib/jvm/jre
# echo "export JAVA_HOME=${JAVA_HOME}" > /etc/profile.d/java.sh

4) Install and configure GCSv4 per our doc here. Be sure that you’ve tested that your base GCS install is functional before proceeding - as this is very important. We document a process for basic endpoint functionality testing here.

5) Install the HDFS DSI libraries into their final location (ex. /usr/local/lib64). These library files are the files discussed in Step 6 in the DSI build section. Record the DSI location as DSI_ROOT

Example:

# ls -lah ./gridftp-hdfs/libglobus_gridftp_server_hdfs.*
lrwxrwxrwx 1 root root  34 Sep 20 15:30 libglobus_gridftp_server_hdfs.so -> libglobus_gridftp_server_hdfs.so.0
lrwxrwxrwx 1 root root  38 Sep 20 15:30 libglobus_gridftp_server_hdfs.so.0 -> libglobus_gridftp_server_hdfs.so.0.0.1
-rwxr-xr-x 1 root root 96K Sep 20 15:30 libglobus_gridftp_server_hdfs.so.0.0.1

# cp -P ./gridftp-hdfs/libglobus_gridftp_server_hdfs.* /usr/local/lib64/

6) Add the JAVA C libraries, HADOOP libraries and DSI to the trusted library path. Do this by adding the appropriate paths to /etc/ld.so.conf.d/gridftp-hdfs-x86_64.conf then running ldconfig.

Example:

# export JAVA_HOME=/usr/lib/jvm/jre
# export HADOOP_ROOT=/usr/local
# export DSI_ROOT=/usr/local/lib64

# echo $JAVA_HOME/lib/amd64/server > /etc/ld.so.conf.d/gridftp-hdfs-x86_64.conf
# echo $HADOOP_ROOT/hadoop-2.8.1/lib/native/ >> /etc/ld.so.conf.d/gridftp-hdfs-x86_64.conf
# echo $DSI_ROOT >> /etc/ld.so.conf.d/gridftp-hdfs-x86_64.conf

# ldconfig

7) Configure a systemd drop in for the globus-gridftp-server service to set the environment variables needed for the DSI to function.

Example:

# export JAVA_HOME=/usr/lib/jvm/jre
# export HADOOP_ROOT=/usr/local
# export DSI_ROOT=/usr/local/lib64

# mkdir -p /etc/systemd/system/globus-gridftp-server.service.d/

# echo '[Service]' > /etc/systemd/system/globus-gridftp-server.service.d/override.conf

# echo 'Environment="LD_LIBRARY_PATH='$JAVA_HOME'/lib/amd64/server:'$HADOOP_ROOT'/hadoop-2.8.1/lib/native/:'$DSI_ROOT'"' >> /etc/systemd/system/globus-gridftp-server.service.d/override.conf

# echo 'Environment="LIBHDFS_OPTS=-Xmx64m"' >> /etc/systemd/system/globus-gridftp-server.service.d/override.conf

# echo 'Environment="LD_PRELOAD=libjsig.so"' >> /etc/systemd/system/globus-gridftp-server.service.d/override.conf

# echo 'Environment="GRIDFTP_HDFS_CHECKSUMS=MD5"' >> /etc/systemd/system/globus-gridftp-server.service.d/override.conf

# echo 'Environment="JAVA_HOME='$JAVA_HOME'"' >> /etc/systemd/system/globus-gridftp-server.service.d/override.conf

8) Create a helper launch script for GridFTP :

# export HADOOP_ROOT=/usr/local

# echo '#!/bin/bash' > /usr/sbin/globus-gridftp-server-hdfs
# echo "export CLASSPATH=$(${HADOOP_ROOT}/hadoop-2.8.1/bin/hadoop classpath --glob)" >> /usr/sbin/globus-gridftp-server-hdfs
# echo 'exec /usr/sbin/globus-gridftp-server $@' >> /usr/sbin/globus-gridftp-server-hdfs

# chmod +x /usr/sbin/globus-gridftp-server-hdfs

9) Create a file /etc/gridftp.d/gridftp-hdfs that contains only the following:

$GLOBUS_THREAD_MODEL pthread
blocksize 1048576
load_dsi_module hdfs
exec /usr/sbin/globus-gridftp-server-hdfs

10) Restart the GridFTP service.

# systemctl daemon-reload
# systemctl restart globus-gridftp-server.service

11) Create the /cksums directory in HDFS.

Example:

# export HADOOP_ROOT=/usr/local

# ${HADOOP_ROOT}/hadoop-2.8.1/bin/hdfs dfs -mkdir /cksums
# ${HADOOP_ROOT}/hadoop-2.8.1/bin/hdfs dfs -chown root /cksums
# ${HADOOP_ROOT}/hadoop-2.8.1/bin/hdfs dfs -chmod 700 /cksums

Debugging Tips

The HDFS DSI will write events to the GridFTP log file. You can read more about how to configure GridFTP logging here. To capture more verbose logging for the HDFS DSI, create a file /etc/gridftp.d/hdfs-debug with the following contents:

log_level ALL
log_single /var/log/gridftp.log
$GLOBUS_GRIDFTP_SERVER_DEBUG "ALL,/var/log/gridftp.log"

Basic Endpoint Functionality Test

After completing the installation, you should do some basic transfer tests with your endpoint to ensure that it is working. We document a process for basic endpoint functionality testing here.


© 2010- The University of Chicago Legal