Filesystems

Interface

enum arrow::fs::FileType

FileSystem entry type.

Values:

enumerator NotFound

Entry is not found.

enumerator Unknown

Entry exists but its type is unknown.

This can designate a special file such as a Unix socket or character device, or Windows NUL / CON / …

enumerator File

Entry is a regular file.

enumerator Directory

Entry is a directory.

struct arrow::fs::FileInfo : public arrow::util::EqualityComparable<FileInfo>

FileSystem entry info.

Public Functions

FileType type() const

The file type.

const std::string &path() const

The full file path in the filesystem.

std::string base_name() const

The file base name (component after the last directory separator)

int64_t size() const

The size in bytes, if available.

Only regular files are guaranteed to have a size.

std::string extension() const

The file extension (excluding the dot)

TimePoint mtime() const

The time of last modification, if available.

struct ByPath

Function object implementing less-than comparison and hashing by path, to support sorting infos, using them as keys, and other interactions with the STL.

struct arrow::fs::FileSelector

File selector for filesystem APIs.

Public Members

std::string base_dir

The directory in which to select files.

If the path exists but doesn’t point to a directory, this should be an error.

bool allow_not_found

The behavior if base_dir isn’t found in the filesystem.

If false, an error is returned. If true, an empty selection is returned.

bool recursive

Whether to recurse into subdirectories.

int32_t max_recursion

The maximum number of subdirectories to recurse into.

class arrow::fs::FileSystem : public std::enable_shared_from_this<FileSystem>

Abstract file system API.

Subclassed by arrow::fs::HadoopFileSystem, arrow::fs::internal::MockFileSystem, arrow::fs::LocalFileSystem, arrow::fs::S3FileSystem, arrow::fs::SlowFileSystem, arrow::fs::SubTreeFileSystem, arrow::py::fs::PyFileSystem

Public Functions

Result<std::string> NormalizePath(std::string path)

Normalize path for the given filesystem.

The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).

Result<FileInfo> GetFileInfo(const std::string &path) = 0

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

Result<std::vector<FileInfo>> GetFileInfo(const std::vector<std::string> &paths)

Same, for many targets at once.

Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) = 0

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

Status CreateDir(const std::string &path, bool recursive = true) = 0

Create a directory and subdirectories.

This function succeeds if the directory already exists.

Status DeleteDir(const std::string &path) = 0

Delete a directory and its contents, recursively.

Status DeleteDirContents(const std::string &path) = 0

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.

Status DeleteRootDirContents() = 0

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

Status DeleteFile(const std::string &path) = 0

Delete a file.

Status DeleteFiles(const std::vector<std::string> &paths)

Delete many files.

The default implementation issues individual delete operations in sequence.

Status Move(const std::string &src, const std::string &dest) = 0

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

Status CopyFile(const std::string &src, const std::string &dest) = 0

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) = 0

Open an input stream for sequential reading.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const FileInfo &info)

Open an input stream for sequential reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) = 0

Open an input file for random access reading.

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const FileInfo &info)

Open an input file for random access reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path) = 0

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path) = 0

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

High-level factory function

Result<std::shared_ptr<FileSystem>> FileSystemFromUri(const std::string &uri, std::string *out_path = NULLPTR)

Create a new FileSystem by URI.

Recognized schemes are “file”, “mock”, “hdfs” and “s3fs”.

Return

out_fs FileSystem instance.

Parameters

Result<std::shared_ptr<FileSystem>> FileSystemFromUriOrPath(const std::string &uri, std::string *out_path = NULLPTR)

Create a new FileSystem by URI.

Same as FileSystemFromUri, but in addition also recognize non-URIs and treat them as local filesystem paths. Only absolute local filesystem paths are allowed.

Concrete implementations

class arrow::fs::SubTreeFileSystem : public arrow::fs::FileSystem

A FileSystem implementation that delegates to another implementation after prepending a fixed base path.

This is useful to expose a logical view of a subtree of a filesystem, for example a directory in a LocalFileSystem. This works on abstract paths, i.e. paths using forward slashes and and a single root “/”. Windows paths are not guaranteed to work. This makes no security guarantee. For example, symlinks may allow to “escape” the subtree and access other parts of the underlying filesystem.

Public Functions

Result<std::string> NormalizePath(std::string path) override

Normalize path for the given filesystem.

The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).

Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

Status DeleteDirContents(const std::string &path) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.

Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

Status DeleteFile(const std::string &path) override

Delete a file.

Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Open an input stream for sequential reading.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const FileInfo &info) override

Open an input stream for sequential reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Open an input file for random access reading.

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const FileInfo &info) override

Open an input file for random access reading.

This override assumes the given FileInfo validly represents the file’s characteristics, and may optimize access depending on them (for example avoid querying the file size or its existence).

Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path) override

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

struct arrow::fs::LocalFileSystemOptions

Options for the LocalFileSystem implementation.

Public Members

bool use_mmap = false

Whether OpenInputStream and OpenInputFile return a mmap’ed file, or a regular one.

Public Static Functions

LocalFileSystemOptions Defaults()

Initialize with defaults.

class arrow::fs::LocalFileSystem : public arrow::fs::FileSystem

A FileSystem implementation accessing files on the local machine.

This class handles only /-separated paths. If desired, conversion from Windows backslash-separated paths should be done by the caller. Details such as symlinks are abstracted away (symlinks are always followed, except when deleting an entry).

Public Functions

Result<std::string> NormalizePath(std::string path) override

Normalize path for the given filesystem.

The default implementation of this method is a no-op, but subclasses may allow normalizing irregular path forms (such as Windows local paths).

Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

Status DeleteDirContents(const std::string &path) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.

Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

Status DeleteFile(const std::string &path) override

Delete a file.

Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Open an input stream for sequential reading.

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Open an input file for random access reading.

Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path) override

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

struct arrow::fs::S3Options

Options for the S3FileSystem implementation.

Public Functions

void ConfigureDefaultCredentials()

Configure with the default AWS credentials provider chain.

void ConfigureAnonymousCredentials()

Configure with anonymous credentials. This will only let you access public buckets.

void ConfigureAccessKey(const std::string &access_key, const std::string &secret_key, const std::string &session_token = "")

Configure with explicit access and secret key.

void ConfigureAssumeRoleCredentials(const std::string &role_arn, const std::string &session_name = "", const std::string &external_id = "", int load_frequency = 900, const std::shared_ptr<Aws::STS::STSClient> &stsClient = NULLPTR)

Configure with credentials from an assumed role.

Public Members

std::string region

AWS region to connect to.

If unset, the AWS SDK will choose a default value. The exact algorithm depends on the SDK version. Before 1.8, the default is hardcoded to “us-east-1”. Since 1.8, several heuristics are used to determine the region (environment variables, configuration profile, EC2 metadata server).

std::string endpoint_override

If non-empty, override region with a connect string such as “localhost:9000”.

std::string scheme = "https"

S3 connection transport, default “https”.

std::string role_arn

ARN of role to assume.

std::string session_name

Optional identifier for an assumed role session.

std::string external_id

Optional external idenitifer to pass to STS when assuming a role.

int load_frequency

Frequency (in seconds) to refresh temporary credentials from assumed role.

std::shared_ptr<Aws::Auth::AWSCredentialsProvider> credentials_provider

AWS credentials provider.

bool background_writes = true

Whether OutputStream writes will be issued in the background, without blocking.

Public Static Functions

S3Options Defaults()

Initialize with default credentials provider chain.

This is recommended if you use the standard AWS environment variables and/or configuration file.

S3Options Anonymous()

Initialize with anonymous credentials.

This will only let you access public buckets.

S3Options FromAccessKey(const std::string &access_key, const std::string &secret_key, const std::string &session_token = "")

Initialize with explicit access and secret key.

Optionally, a session token may also be provided for temporary credentials (from STS).

S3Options FromAssumeRole(const std::string &role_arn, const std::string &session_name = "", const std::string &external_id = "", int load_frequency = 900, const std::shared_ptr<Aws::STS::STSClient> &stsClient = NULLPTR)

Initialize from an assumed role.

class arrow::fs::S3FileSystem : public arrow::fs::FileSystem

S3-backed FileSystem implementation.

Some implementation notes:

  • buckets are special and the operations available on them may be limited or more expensive than desired.

Public Functions

S3Options options() const

Return the original S3 options when constructing the filesystem.

std::string region() const

Return the actual region this filesystem connects to.

Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

Status DeleteDirContents(const std::string &path) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.

Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

Status DeleteFile(const std::string &path) override

Delete a file.

Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Create a sequential input stream for reading from a S3 object.

NOTE: Reads from the stream will be synchronous and unbuffered. You way want to wrap the stream in a BufferedInputStream or use a custom readahead strategy to avoid idle waits.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const FileInfo &info) override

Create a sequential input stream for reading from a S3 object.

This override avoids a HEAD request by assuming the FileInfo contains correct information.

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Create a random access file for reading from a S3 object.

See OpenInputStream for performance notes.

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const FileInfo &info) override

Create a random access file for reading from a S3 object.

This override avoids a HEAD request by assuming the FileInfo contains correct information.

Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path) override

Create a sequential output stream for writing to a S3 object.

NOTE: Writes to the stream will be buffered. Depending on S3Options.background_writes, they can be synchronous or not. It is recommended to enable background_writes unless you prefer implementing your own background execution strategy.

Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Public Static Functions

Result<std::shared_ptr<S3FileSystem>> Make(const S3Options &options)

Create a S3FileSystem instance from the given options.

struct arrow::fs::HdfsOptions

Options for the HDFS implementation.

Public Members

io::HdfsConnectionConfig connection_config

Hdfs configuration options, contains host, port, driver.

int32_t buffer_size = 0

Used by Hdfs OpenWritable Interface.

class arrow::fs::HadoopFileSystem : public arrow::fs::FileSystem

HDFS-backed FileSystem implementation.

implementation notes:

  • This is a wrapper of arrow/io/hdfs, so we can use FileSystem API to handle hdfs.

Public Functions

Result<FileInfo> GetFileInfo(const std::string &path) override

Get info for the given target.

Any symlink is automatically dereferenced, recursively. A nonexistent or unreachable file returns an Ok status and has a FileType of value NotFound. An error status indicates a truly exceptional condition (low-level I/O error, etc.).

Result<std::vector<FileInfo>> GetFileInfo(const FileSelector &select) override

Same, according to a selector.

The selector’s base directory will not be part of the results, even if it exists. If it doesn’t exist, see FileSelector::allow_not_found.

Status CreateDir(const std::string &path, bool recursive = true) override

Create a directory and subdirectories.

This function succeeds if the directory already exists.

Status DeleteDir(const std::string &path) override

Delete a directory and its contents, recursively.

Status DeleteDirContents(const std::string &path) override

Delete a directory’s contents, recursively.

Like DeleteDir, but doesn’t delete the directory itself. Passing an empty path (“” or “/”) is disallowed, see DeleteRootDirContents.

Status DeleteRootDirContents() override

EXPERIMENTAL: Delete the root directory’s contents, recursively.

Implementations may decide to raise an error if this operation is too dangerous.

Status DeleteFile(const std::string &path) override

Delete a file.

Status Move(const std::string &src, const std::string &dest) override

Move / rename a file or directory.

If the destination exists:

  • if it is a non-empty directory, an error is returned

  • otherwise, if it has the same type as the source, it is replaced

  • otherwise, behavior is unspecified (implementation-dependent).

Status CopyFile(const std::string &src, const std::string &dest) override

Copy a file.

If the destination exists and is a directory, an error is returned. Otherwise, it is replaced.

Result<std::shared_ptr<io::InputStream>> OpenInputStream(const std::string &path) override

Open an input stream for sequential reading.

Result<std::shared_ptr<io::RandomAccessFile>> OpenInputFile(const std::string &path) override

Open an input file for random access reading.

Result<std::shared_ptr<io::OutputStream>> OpenOutputStream(const std::string &path) override

Open an output stream for sequential writing.

If the target already exists, existing data is truncated.

Result<std::shared_ptr<io::OutputStream>> OpenAppendStream(const std::string &path) override

Open an output stream for appending.

If the target doesn’t exist, a new empty file is created.

Public Static Functions

Result<std::shared_ptr<HadoopFileSystem>> Make(const HdfsOptions &options)

Create a HdfsFileSystem instance from the given options.