How to deploy fine-tuned GPT-2 model using AWS SageMaker

9 min readNov 1, 2022

Just a while ago, during the Linq conference, we presented a tweet generator styled after Elon Musk. To create it, we took a ready-made GPT-2 Medium model and trained/stylized it using our dataset. Fine-tuned it, that is. The model requires a lot of resources for the task, so not everyone is capable of doing a similar job on their PC. However, the challenge is solved once you use Google Colab, for instance. We trained the model using the Kaggle platform. But after this, you get another task: the model has to be deployed to create a fully functional service.

Of course, there is more than one way to do this. Today, we’ll show you how to deploy a GPT-2 model to achieve real-time inference using Amazon SageMaker.

Open your AWS account.
Since we want to deploy a fine-tuned model, we need to upload all required files into the cloud. Use Search to find S3. Choose the first service.

3. Click on Create bucket.

4. Then we need to configure the bucket: specify its name and AWS region, adjust access permissions, etc. It’s enough to specify the details for the General configuration tab only.

5. Having set the details, click on Create bucket.

6. An Amazon S3 page will open where you will see something like this:

7. Then you will need to upload the model files. Click on the name of the created bucket and, in the opened window, select Upload.

8. Drag and drop the archive with the model files to the area or click on Add files. If needed, you can build a hierarchy within the bucket by adding folders using the Add folders option. It’s worth noting that the model files must be in an archive with the tar.gz extension.

9. Click on Upload and wait until the uploading process is complete.

10. When uploading is done, you can start deploying the model. Search for SageMaker and choose the first service.

11. To work with this tool, you must first set up SageMaker Domain, so click on Get Started on the New to SageMaker? pop-up banner.

12. For simplified configuring of 1 user, select Quick setup and click on Set up SageMaker Domain.

13. Specify the username and the IAM role. We can also create a new role and specify what S3 buckets the user will have access to. As an example, we will allow access to all buckets.

14. Click on Submit.

15. You will have to wait a bit until SageMaker Domain and the user are set up.

16. After it’s done, you will see the freshly created user in the list, and you will be able to launch Studio by clicking on the Launch app. SageMaker Studio is an IDE that allows working with Jupyter’s notebooks in the AWS cloud.

17. Now is another spare moment to enjoy your coffee/tea.

18. Finally, we get into SageMaker Studio. Browsing among the tabs on the left, you can:

see your active repository where your notebooks and files will be stored;
see launched instances and applications, Kernel and Terminal Sessions;
work with Git’s repository;
manage SageMaker’s resources;
install extensions for Jupyter’s notebooks.

19. Let’s concisely introduce SageMaker JumpStart. This service offers pretrained, open-source models for a wide range of problems. You can train and fine-tune these models before deployment. JumpStart also provides solution templates to set up infrastructure for most common cases and executable notebooks for machine learning with SageMaker.

20. Despite the out-of-the-box solutions, to deploy our fine-tuned GPT-2 model, we will create a new notebook to specify everything we need. For this, click on + in the light-blue rectangle in the toolbar on top of the left panel. A Launcher tab will pop up. You will need to scroll down to the Notebooks and compute resources section and select Notebook Python 3 there.

21. Here goes another pause while the notebook’s core gets ready to work.

22. Finally, we can write the code.

23. It should be noted that you can designate the instance for a particular notebook. For example, you can always switch to a different instance if your model needs more resources. But remember that you will have to pay accordingly.

24. Using a notebook, you pay for the running time plus the chosen type of instance.

25. For a simple deployment of our model, we can employ a ready-made setup from Hugging Face. Click on Deploy and select Amazon SageMaker. Then choose the task (in our case, Text Generation) and the configuration (AWS in this example). Copy the code to the notebook.

26. Since we use our own pretrained model, not the one from the Hugging Face repository, we have to introduce a few modifications to the code. Comment in the hub directory a line with the key ‘HF_MODEL_ID’ and add, in HuggingFaceModel, a key model_data where you will need to specify the path to the archive with the model files.

# Hub Model configuration. https://huggingface.co/modelshub = {# 'HF_MODEL_ID':'gpt2-medium','HF_TASK':'text-generation'}
# create Hugging Face Model Classhuggingface_model = HuggingFaceModel(transformers_version='4.17.0',pytorch_version='1.10.2',py_version='py38',env=hub,role=role,model_data='s3://my-bucket-for-gpt2/gpt2-medium-musk.tar.gz',)

27. In the deploy method of huggingface_model, you can specify on what instance to deploy the model by indicating it in the instance_type. Most instances may be unavailable because of the lack of needed quotas. You will have to ask for them, addressing the AWS support team. In this case, an error like that will appear:

28. If the model has been created and deployed successfully (it takes some time, though), you can use the predict method.

29. To use the instance outside AWS, you must create an access key.

Use Search to find S3. Choose the first service.

In the opened window, choose the User tab and select the username you use now.

Go to the Security credentials tab and click on Create access key.

Copy the Access key ID and Secret access key and save them to a safe place.

30. Next, you need to find the name of the endpoint of the model. In Studio, go to the SageMaker resources tab in the left panel and open the Endpoints resource. Double-click on the name of your endpoint. A tab with details will open, where you can copy its name from.

31. Now, let’s write the code to access the model from outside.

import boto3import jsonimport time
endpoint_name = '<my_endpoint_name>'aws_access_key_id = '<my_aws_access_key_id>'aws_secret_access_key = '<my_aws_secret_access_key>'
sagemaker_runtime = boto3.client("sagemaker-runtime",region_name='us-east-1',aws_access_key_id=aws_access_key_id,aws_secret_access_key=aws_secret_access_key)data = {"inputs": "Weed is",}response = sagemaker_runtime.invoke_endpoint(EndpointName=endpoint_name,ContentType='application/json',Body=json.dumps(data, ensure_ascii=False).encode('utf8'))print(response['Body'].read().decode('utf-8'))

So, let’s test it:

32. Please note that if you follow the above-described steps, your model will use default settings for generation. To add custom logic for the model upload, pre- and post-processing of data, and prediction, you can create an inference.py file in Studio to be stored with the notebook and readjust the needed methods there. You can read more about this here.

To run this script during your model’s deployment, add one more parameter to HuggingFaceModel:

huggingface_model = HuggingFaceModel(transformers_version='4.17.0',pytorch_version='1.10.2',py_version='py38',env=hub,role=role,model_data='s3://my-bucket-for-gpt2/gpt2-medium-musk.tar.gz',entry_point='inference.py')

Of course, such modification will exclude the already created endpoints. You will have to redeploy the model.

Here is an example of the inference.py file that you can use to deploy a GPT-2 model:

import jsonimport torchfrom transformers import GPT2Config, GPT2Tokenizer, GPT2LMHeadModeldef model_fn(model_dir):configuration = GPT2Config.from_pretrained(model_dir, output_hidden_states=False)tokenizer = GPT2Tokenizer.from_pretrained(model_dir,bos_token='<|sos|>',eos_token='<|eos|>',pad_token='<|pad|>')model = GPT2LMHeadModel.from_pretrained(model_dir, config=configuration)model.resize_token_embeddings(len(tokenizer))model.eval()return (model, tokenizer)def input_fn(request_body, request_content_type):if request_content_type == "application/json":request = json.loads(request_body)else:request = request_bodyreturn requestdef predict_fn(data, model_tokenizer):model, tokenizer = model_tokenizerinputs = data.pop("inputs", data)max_length = data.pop("max_length", data)input_ids = torch.tensor(tokenizer.encode(f'<|sos|>{inputs}')).unsqueeze(0)outputs = model.generate(input_ids,max_length=max_length,bos_token_id=tokenizer.bos_token_id,pad_token_id=tokenizer.pad_token_id,eos_token_id=tokenizer.eos_token_id,do_sample=True,top_k=0,top_p=0.95,no_repeat_ngram_size=4)decoded_output = tokenizer.decode(outputs[0])return {"decoded_output": decoded_output}

33. Having finished all your work with the notebook in Studio, you must shut down all used resources. Unfortunately, if you simply close Studio’s window, the resources aren’t halted, so you have to do this manually. Otherwise, you will have to pay for their use. So, you can shut down everything you don’t need in Studio itself choosing the Running Terminal and Kernels tab on the left.

Having shut down the notebook, you can double-check that all resources are halted on the Amazon SageMaker page. For this, you will have to click on the username and check your KernelGateway app’s status. The status should be set to Deleted.

34. When you don’t need the deployed model anymore, you will need to remove the endpoint. If you haven’t shut down the resources the notebook uses in Studio, you can do so right there by writing the following line:

predictor.delete_endpoint()

Another option to remove the endpoint is to go to the Amazon SageMaker page. There, on the left panel, you will need to choose the Inference tab; then, in the drop-down menu, select Endpoints, find the corresponding endpoint on the right, and click on Actions and Delete.

You can also delete the created models by going to Inference→Models and configure the endpoints by going to Inference→Endpoint Configurations.

So, we have walked you step-by-step through the process of deploying a fine-tuned GPT-2 model for real-time inference using Amazon SageMaker. It is worthwhile to say that there are multiple deployment scenarios, each with its advantages, for example, an asynchronous method, batch processing, a cold start, etc. The choice of each option depends on the goal of the task. You can read more about other deployment methods using Amazon SageMaker here.

How to deploy fine-tuned GPT-2 model using AWS SageMaker

Written by Nick Valiotti

No responses yet