A colleague at work mentioned seeing this log in some security logs:
User: arn:aws:sts::123456789012:assumed-role/eks-cluster-efs/1111111111111111111 is not authorized to perform: elasticfilesystem:DeleteAccessPoint on the specified resource
I thought “Oh, that’s simple, it must be that the IAM policy we applied to the assumed role for the EFS driver was missing that permission”. It’s never that simple.
When I checked the logs for the efs-csi-controller pods, specifically the csi-provisioner container, I saw lots of blocks like this:
I0210 08:15:21.944620 1 event.go:389] "Event occurred" object="pvc-aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied"
E0210 08:15:21.944609 1 controller.go:1025] error syncing volume "pvc-aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb": rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied
I0210 08:15:21.944598 1 controller.go:1007] "Retrying syncing volume" key="pvc-aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb" failures=9
But wait… I don’t have a PVC with that key. And then it hit me. Several months ago we had a stuck PVC, and we ended up having to delete it in a really weird way, which included deleting the directory at the EFS level directly… I don’t even recall the details now, but I remember it being quite painful.
Anyway, to resolve the above issue, what you need to do is find your policy for the EFS role, and look for this block:
{
"Effect": "Allow",
"Action": "elasticfilesystem:DeleteAccessPoint",
"Resource": "*",
"Condition": {
"StringEquals": {
"aws:ResourceTag/efs.csi.aws.com/cluster": "true"
}
}
}
And then remove the condition, so it looks like this:
{
"Effect": "Allow",
"Action": "elasticfilesystem:DeleteAccessPoint",
"Resource": "*"
}
Apply this new policy, restart the efs-csi-controller deployment (or just delete all the pods that are in the deployment)… give it 2 minutes, and then re-apply the previous IAM policy (with the Condition block). Tada. All gone.
So what was happening? The Cluster somehow still remembered there had been a volume with that PVC ID, but when it was trying to delete it from the EFS volume, the access point had been removed because the underlying path had gone. As a result, the access point didn’t have that tag (aws:ResourceTag/efs.csi.aws.com/cluster = true) so it couldn’t delete it.
I’d never have found it without finding issue #1722 in the EFS CSI Driver github repository which linked to issue #522, and specifically this comment which said to delete the condition… even though the rest of the context isn’t quite right.
Featured image is “Magnifying glass” by “Michael Pedersen” on Flickr and is released under a CC-BY license.