A magnifying glass focused on the letter H looking at a page of letters in alphabetical order, with a red background

EFS-CSI-Driver and the curse of the missing (deleted) PVC in #Kubernetes

A colleague at work mentioned seeing this log in some security logs:

User: arn:aws:sts::123456789012:assumed-role/eks-cluster-efs/1111111111111111111 is not authorized to perform: elasticfilesystem:DeleteAccessPoint on the specified resource

I thought “Oh, that’s simple, it must be that the IAM policy we applied to the assumed role for the EFS driver was missing that permission”. It’s never that simple.

When I checked the logs for the efs-csi-controller pods, specifically the csi-provisioner container, I saw lots of blocks like this:

I0210 08:15:21.944620       1 event.go:389] "Event occurred" object="pvc-aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb" fieldPath="" kind="PersistentVolume" apiVersion="v1" type="Warning" reason="VolumeFailedDelete" message="rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied"
E0210 08:15:21.944609       1 controller.go:1025] error syncing volume "pvc-aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb": rpc error: code = Unauthenticated desc = Access Denied. Please ensure you have the right AWS permissions: Access denied
I0210 08:15:21.944598       1 controller.go:1007] "Retrying syncing volume" key="pvc-aaaaaaaa-1111-2222-3333-bbbbbbbbbbbb" failures=9

But wait… I don’t have a PVC with that key. And then it hit me. Several months ago we had a stuck PVC, and we ended up having to delete it in a really weird way, which included deleting the directory at the EFS level directly… I don’t even recall the details now, but I remember it being quite painful.

Anyway, to resolve the above issue, what you need to do is find your policy for the EFS role, and look for this block:

{
  "Effect": "Allow",
  "Action": "elasticfilesystem:DeleteAccessPoint",
  "Resource": "*",
  "Condition": {
    "StringEquals": {
      "aws:ResourceTag/efs.csi.aws.com/cluster": "true"
    }
  }
}

And then remove the condition, so it looks like this:

{
  "Effect": "Allow",
  "Action": "elasticfilesystem:DeleteAccessPoint",
  "Resource": "*"
}

Apply this new policy, restart the efs-csi-controller deployment (or just delete all the pods that are in the deployment)… give it 2 minutes, and then re-apply the previous IAM policy (with the Condition block). Tada. All gone.

So what was happening? The Cluster somehow still remembered there had been a volume with that PVC ID, but when it was trying to delete it from the EFS volume, the access point had been removed because the underlying path had gone. As a result, the access point didn’t have that tag (aws:ResourceTag/efs.csi.aws.com/cluster = true) so it couldn’t delete it.

I’d never have found it without finding issue #1722 in the EFS CSI Driver github repository which linked to issue #522, and specifically this comment which said to delete the condition… even though the rest of the context isn’t quite right.

Featured image is “Magnifying glass” by “Michael Pedersen” on Flickr and is released under a CC-BY license.

JonTheNiceGuy

He/Him. Husband and father. Linux advocating geek. Co-Host on the AdminAdmin Podcast, occasional conference speaker.

Leave a Reply

This site uses Akismet to reduce spam. Learn how your comment data is processed.

To respond on your own website, enter the URL of your response which should contain a link to this post's permalink URL. Your response will then appear (possibly after moderation) on this page. Want to update or remove your response? Update or delete your post and re-enter your post's URL again. (Find out more about Webmentions.)