Skip to content

Snapshots on large indices fail on some shards when master election occurs #35229

@clandry94

Description

@clandry94

Elasticsearch version (bin/elasticsearch --version): 6.4.1

Plugins installed: [analysis-kuromoji, analysis-icu, repository-gcs]

JVM version (java -version): 10.0.2

OS version (uname -a if on a Unix-like system): Linux 4.14.56+

Description of the problem including expected versus actual behavior:
Expected: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes completes successfully even if a master election occurs mid snapshot.

Actual: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes fails on some shards with IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException if a master election occurs during the snapshot.

I think that this issue may be known and handled in the case that an election occurs during the finalizeSnapshot step as seen here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L550. This seems like a similar bug. Perhaps the new master attempts to also create a snapshot of the shard, overwriting the successful snapshot with the failed?

Steps to reproduce:

Please include a minimal but complete recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc. The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

  1. Create a cluster with dedicated master nodes
  2. Create a large index pointed at a GCS repo with specs higher than above (>8TB index size and >800 shards)
  3. Perform a snapshot (snapshots at this size will take between 30m and 2 hours)

Provide logs (if relevant):
Snapshot status in _cat/snapshots/<repo name>
2018-10-16t19-57-07 PARTIAL 1539719828 19:57:08 1539722642 20:44:02 46.8m 70 2716 4 2720

Snippet of errors seen in logs from viewing snapshot status in API

{
          "index" : "<index_name>",
          "index_uuid" : "<index_name>",
          "shard_id" : 178,
          "reason" : "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException[indices/Y0qyoa00TuaVx0iYLpVbXw/178/__1oc: Precondition Failed]; ",
          "node_id" : "lkwyayn5QVGh8okWvRIJpg",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "<index_name>,
          "index_uuid" : "<index_name>",
          "shard_id" : 45,
          "reason" : "IndexShardSnapshotFailedException[com.google.cloud.storage.StorageException: Error writing request body to server]; nested: StorageException[Error writing request body to server]; nested: IOException[Error writing request body to server]; ",
          "node_id" : "xlJM3LeIS2SlitmN71R6fA",
          "status" : "INTERNAL_SERVER_ERROR"
        },
}

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions