ERROR when writing file to S3 bucket from EMRFS enabled Spark cluster

ERROR :

18/03/02 01:42:17 INFO RetryInvocationHandler: Exception while invoking ConsistencyCheckerS3FileSystem.mkdirs over null. Retrying after sleeping for 10000ms. com.amazon.ws.emr.hadoop.fs.consistency.exception.ConsistencyException: Directory ‘bucket/folder/_temporary’ present in the metadata but not s3 at com.amazon.ws.emr.hadoop.fs.consistency.ConsistencyCheckerS3FileSystem.getFileStatus(ConsistencyCheckerS3FileSystem.java:506)

 

Root cause :

Mostly the consistent problem comes due to

  • Manual deletion of files and directory from S3 console
  • retry logic in spark and hadoop systems.
  • When a process of creating a file on s3 failed, but it already updated in the dynamodb.
  • when the hadoop process restarts the process as the entry is already present in the dynamodb. It throws the consistent error.

Solution :

Try re-run your spark job by cleaning up the EMRFS metadata in dynamo db.

Follow the steps to clean-up & Restore the indended specific directory in the S3 bucket….

 

Deletes all the objects in the path, emrfs delete uses the hash function to delete the records, so it may delete unwanted entries also, so we are doing the import and sync in the consequent steps

Delete all the metadata

emrfs delete   s3://<bucket>/path

Retrieves the metadata for the objects that are physically present in s3 into dynamo db

emrfs import s3://<bucket>/path 

Sync the data between s3 and the metadata.

emrfs sync s3://<bucket>/path 

After all the operations, to see whether that particular object is present in both s3 and metadata

emrfs diff s3://<bucket>/path 
Advertisements