Snapshots on large indices fail on some shards when master election occurs

**Elasticsearch version** (`bin/elasticsearch --version`): 6.4.1

**Plugins installed**: [analysis-kuromoji, analysis-icu, repository-gcs]

**JVM version** (`java -version`): 10.0.2

**OS version** (`uname -a` if on a Unix-like system): Linux 4.14.56+

**Description of the problem including expected versus actual behavior**:
Expected: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes completes successfully even if a master election occurs mid snapshot. 

Actual: Creating a snapshot for a large index (> 8TB) with high number of shards (>800) on a cluster that has dedicated master nodes fails on some shards with `IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException` if a master election occurs during the snapshot. 

I think that this issue may be known and handled in the case that an election occurs during the `finalizeSnapshot` step as seen here https://github.com/elastic/elasticsearch/blob/master/server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java#L550. This seems like a similar bug. Perhaps the new master attempts to also create a snapshot of the shard, overwriting the successful snapshot with the failed? 

**Steps to reproduce**:

Please include a *minimal* but *complete* recreation of the problem, including
(e.g.) index creation, mappings, settings, query etc.  The easier you make for
us to reproduce it, the more likely that somebody will take the time to look at it.

 1. Create a cluster with dedicated master nodes
 2. Create a large index pointed at a GCS repo with specs higher than above (>8TB index size and >800 shards)
 3. Perform a snapshot (snapshots at this size will take between 30m and 2 hours)

**Provide logs (if relevant)**:
Snapshot status in `_cat/snapshots/<repo name>`
`2018-10-16t19-57-07 PARTIAL 1539719828  19:57:08   1539722642 20:44:02    46.8m      70              2716             4         2720`

Snippet of errors seen in logs from viewing snapshot status in API
```
{
          "index" : "<index_name>",
          "index_uuid" : "<index_name>",
          "shard_id" : 178,
          "reason" : "IndexShardSnapshotFailedException[Failed to perform snapshot (index files)]; nested: FileAlreadyExistsException[indices/Y0qyoa00TuaVx0iYLpVbXw/178/__1oc: Precondition Failed]; ",
          "node_id" : "lkwyayn5QVGh8okWvRIJpg",
          "status" : "INTERNAL_SERVER_ERROR"
        },
        {
          "index" : "<index_name>,
          "index_uuid" : "<index_name>",
          "shard_id" : 45,
          "reason" : "IndexShardSnapshotFailedException[com.google.cloud.storage.StorageException: Error writing request body to server]; nested: StorageException[Error writing request body to server]; nested: IOException[Error writing request body to server]; ",
          "node_id" : "xlJM3LeIS2SlitmN71R6fA",
          "status" : "INTERNAL_SERVER_ERROR"
        },
}
```

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshots on large indices fail on some shards when master election occurs #35229

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Snapshots on large indices fail on some shards when master election occurs #35229

Description

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions