Building an Instagram Robot

6 min readApr 22, 2020

Data Ingestion
Recurrent Neural Network
GPT-2
Automated Posting
Web App

Data Ingestion

I started this project with a vague goal of trying to build a python program to simulate user activity on my personal Instagram account, with the initial goal of programmatically posting images with captions. I had worked with Open AI’s GPT-2 model before, which works on a monkey-see, monkey-do method of generating text (for the caption) based on its training data. With that in mind, I knew the first step was to ingest a decent sized training dataset from Instagram. I had no idea how I would get GPT-2’s caption to appear actually relevant to the photo in the post.

Unfortunately for my project, Instagram’s API features have been greatly limited. The solution was to use a Selenium-based web scraper to scrape real Instagram posts from the “Explore” tab. This training data is what was used to train the caption generating portion of the bot, so it tends to be biased towards that data. I would assume that Instagram tends to tailor every user’s Explore tab based on their browsing behavior, however since I don’t use Instagram I considered my Explore tab to be close to the default. For some reason, this includes a lot of Mycology and Harry Styles. I don’t know why. As a result the final captions tend to be predisposed to feature them.

Webscraper in action, this time on specific tags

The initial plan here was to upload the scraped data into my personal S3 bucket, however I found that after 10,000 scraped posts, the actual photos translated to roughly 1.3Gb of data and the related metadata (captions, timestamps, usernames, selection of comments, etc.) to roughly 25Mb. This was easily handled by my local computer.

Recurrent Neural Network

Through some persistent Googling I found this article that uses a recurrent neural network to auto generate captions for photos. I’ve found that basic coding walk through in multiple articles on the internet so I’m unsure of who to credit for the initial work, however that link is the one I used. I should also note that I am not a data scientist- I often had to treat each model as essentially a black box which limited the efficiency of the coding involved.

This network is trained on the Flickr 8K data set which is composed of 8,000 photos each with five captions describing what is in the photo. Unfortunately this data set is no longer hosted by the University of Illinois, however it is not difficult to google around to find it elsewhere. I trained this model on my own PC (super fun getting the proper version of Tensorflow and related libraries lined up and properly installed) with an Nvidia RTX2070 Super GPU which took less than half an hour to run.

This is the portion of the bot responsible for attempting to identify what is in the photo it’s captioning and as a result, its accuracy is skewed by the underlying Flickr training data. It tends to do well identifying people (their gender by hair length?), dogs, and motorcycles, but its somewhat random for anything more abstract or something with text within the photo itself, as these were not present in the training data. Part of the fun for me is trying to guess why the model labeled something it did when it doesn’t make sense on a face value.

This model spits out a sentence caption attempting to describe what’s in the photo. The accuracy here when run on actual Instagram posts is dubious at best, so I did some simple NLP work to parse out the nouns from this caption which are then fed into GPT-2 as keywords for the actual caption generation.

GPT-2

Open AI’s GPT-2 is remarkably good at generating text tuned to the corpus you fine tune it on. In this case, I trained it on the scraped 10,000 actual Instagram captions I pulled from Instagram’s Explore tab. As I mentioned, this is heavily skewed toward Harry Styles and Mycology.

This portion of the project would not have been possible without the work of Max Woolf, whose gpt-2-simple library I used extensively. Vanilla GPT-2 can take in a keyword as a jumping off point for text generation, however it’s literally used as the first word in the generated text and based on the RNN’s keywords, would have greatly limited the returned caption.

Max Woolf’s script to encode training data before feeding it into GPT-2 was the key to linking GPT-2 and the RNN. It auto parses training data to tag keywords which allows GPT-2 to later be fed them in a format akin to regex pattern matching (the RNN’s nouns).

I again trained this model on my own hardware which took roughly 45 minutes to an hour to train.

Automated Posting

The final piece of the puzzle was to string each of the models together and post to Instagram (again, super fun getting all of the dependent libraries installed in their compatible versions. Just want to emphasize how fun that was). Using selenium, I’ve used chrome driver to simulate a mobile phone on my desktop to go through the posting flow. I unfortunately ran into a large problem getting the actual text of the caption into Instagram’s text input box.

GPT-2 is good enough that it includes emojis in its generated output. I highly valued this as emojis are a key feature of Instagram captions, however chrome drivers send_keys method will only support the literal keys on your keyboard, and as such it can’t send emojis. I tried a lot of things here. I tried using Javascript to inject the text into the input field. I tried sim-ing Firefox (can’t simulate mobile devices). The final, janky solution I’m using is to literally copy and paste the text in (through python) sim-ing a CTRL-C, CTRL-V. For anyone following along at home, this means if I click on something else while I’m running this portion of the program, it breaks. But if it doesn’t, it works. So I don’t.

Web App

It was a really cool feeling to complete a first post, but I figured I’d take it one step further to build a web app to allow other people to try it out too. This meant reconfiguring a data ingestion pipeline to allow for a single post upload. There’s a decent amount of detail I could go into here, but I’m getting tired of writing this post so I’ll keep it simple.

Step one was to wrap the model in a Flask app that I could run locally, along with a front end. Anyone that uses it will probably be able to tell I’m a back-end person working on a front-end for my first time. There’s a lot of copy pasted HTML, CSS, Javascript functions, etc., and I had to learn about client side & server side validation, input sanitation, etc. Its messy, but it works.

The web app. It’s running on a non-gpu EC2 so it’s a little slow…

Step two was to wrap it into a more productionizeable Gunicorn app, and then wrapping them and their dependencies into a Docker container. Lesson learned here about forgetting to use a python virtual environment.

Step two (or 3, or 4 or whatever) was to create another docker container running NGINX as a reverse proxy, and then docker composing both containers to run together, again locally. Getting this off the ground, I then deployed the entire package to an EC2 running on my AWS account. I was initially planning on deploying to an ECS cluster with a load balancing mechanism in place, but I got lazy and also wanted to keep my expenses down (avoiding running multiple not-free EC2 instances at once). Combined with a domain name I purchased, et-voila, a web app version for anyone to try.

I may still get around to getting some SSL encryption for the app or Google Ad Sense in place so I can hopefully I can recoup costs of running the server, but idk.

You can follow the account generating posts at @notdavidyu. The web app is available while I still feel like paying for it at www.thegram9000.com