Machine learning scholar adventure: Chapter 6

Generative caption: A photo of a robot hand drawing, digital art
Credit: OpenAI Dall-E2

What progress did I make?

Welcome to chapter 6 of the machine learning scholar adventure. I’ve been hella busy levelling up my machine learning skills since the last edition and time has flashed by for me thanks to all the fun I’ve been having! I’m also happy to report that I made my 100th tweet based on the #100DaysOfMLCode challenge 🔥 Although there were periods where I didn’t participate and days where I did some ML but forgot to tweet, it still feels good to have started and completed the challenge. My 99th and 100th challenge tweet:

Day 100 ~ Finally at 100 and can confirm that evolving my #MachineLearning skills is now a habit 🔥 Today, I did more of the hypothesis testing, unsupervised learning and intro to sql modules on @datacamp 🦾#100DaysOfMLCode #100DaysOfCode #CodeNewbie #AI #DataScience
— Azuremis (@azuremis) July 8, 2022

Challenge accepted. Challenge completed.

For anyone learning ML or any other skill for that matter, I would highly recommend doing a 100DaysofX challenge where X is the skill you want to develop. Having a written record like this really helped maintain my enthusiasm, enabled me to appreciate the small wins along the way and also helped me become more consistent. I’m sure it can do the same for you! Given all the benefits mentioned, I’m going to keep going as it’s now become an enjoyable habit! On that note, here’s what I’ve been up to recently:

Machine learning

Datacamp: Data scientist with python

I’m now 86% of the way through the track and can’t wait to finish! My ability to work with data has definitely improved and this exposure to various methodologies and tools has really broadened my awareness of the data science stack. The certificates for the courses I’ve completed recently are linked below:

I also finished the data literacy fundamentals track which gave me a high level overview of related specialties like data engineering and cloud computing. In hindsight, I wish I’d done the entirety of this before beginning the data scientist with python track as it provides basic context which helps bridge the gap to more advanced tracks.

I’m immensely grateful to the FB Flames Foundation for accepting me on their data analytics program which gives me 1 year’s access to Datacamp Premium for free (normally £250/year). Thanks to this, I’m able to work through tracks more gradually leaving more time and energy to focus on projects where truly accelerated learning happens! The program is currently on waitlist and is super popular so I encourage you to apply now to get a chance at this amazing opportunity.

OpenAI’s GPT3

I made my first mini-project with GPT3. Inspired by attending some parts of the Deep learning labs GPT3 online hackathon, I made a prototype summarisation tool for legal terms and conditions. Why? Well I don’t know about you but I hardly ever read them for online services…so I thought it would be great if GPT3 could do the reading and explain to me in simple terms. I used streamlit to serve the generated model summary. The code is available here and if you have an OpenAI GPT3 key (it’s free), you can also interact with it. See the readme in the github repo for how to get a key.

A screenshot of a form asking for a legal statement and OpenAI API key — Make it simple abeg

This project allowed me to explore the fundamental capabilities of GPT3 and I also learned how useful streamlit is for showcasing data-centric applications. I’m now working on a more advanced project which involves fine tuning GPT3 on a customised data set. I can’t say anymore for now because I’d rather build then tell 😉

Kaggle

Titanic

Last time, I managed to get a score of 0.78468 which corresponded to top 15% in the rankings. With further refinements, my most recent score is 0.78947 which is in the top 8% of rankings. After tuning an an xgboost model with grid search cross-validation , I then ensembled all the tuned models together with a voting classifier. You can view my notebook here. I also tidied up the notebook as I want to make my work understandable for myself and future collaborators.

Interestingly, there are perfect scores for this competition but I found out this was because of submitting the actual answers to the test set. Given this and the abundance of other competition opportunities, it made no sense to invest more time in this competition when I’d already reached an acceptable performance level so I left the titanic to rest.

Phrase matching

Things are getting sweeter in the phrase to phrase matching competition, I finished exploring Jeremy’s starter notebook which helped me get a private score of 0.8076 and public score of 0.7972. I started working through the iteration notebook but paused working on this so I could focus on more immediate concerns. Whilst the competition finished in June, it’s still possible to make late submissions. Although no points or ranks will be awarded, the value of the learning experience is immeasurable.

For example, just from attempting to reproduce Jeremy’s insights I learned that code competitions on kaggle don’t allow internet access. This means that training & inference must occur in separate notebooks. Finding out how to do this was non-trivial as there weren’t super clear examples I could adapt or perhaps I need to get better at searching the platform 🤔 Nonetheless, I remained tenacious and found a notebook which showed me how to save a model. I also had to find out how to reload and use my pretrained model. This search could have been more efficient if I’d stepped back, rested, progressed on another task then come back with fresh eyes.

With this knowledge, the pipeline started taking shape with separate training and inference notebooks. Whilst I’m not sure the code I’ve got is the best way to do it, I was happy that I got something working and I know more iteration with yield dividends. I still need to figure out how to manage model updating and how to clear and organise the kaggle folder. I will share the notebooks when I’ve refined them further.

I also came up with a framework for approaching competitions. This will help me with tracking my development time, experiments, time, etc much better. The stages I identified are:

Exploration – exploring the data, making my first submission with base model
Experimentation – trying different models, hyperparameter tuning, ensembling
Resolution – identifying and refining candidate models further, choosing top model
Presentation – tidying up the notebook, adding comments, references, crediting sources

Inspiration

Hugging face

My enthusiasm for Hugging Face continues to grow! I attended their superb How to teach open-source ML tools workshop. I learned more about making helpful model cards, how impactful demos can be and how the Hugging Face for classrooms initiative can be used to help others get into machine learning. This really helped me clarify the direction of a future project I wish to start for the community. I’ll keep you posted 🤗

Two Minute Papers

My favourite research development recently is from the Google AI research team. This incredible merger of OpenAI‘s GPT3, computer vision & reinforcement learning based robotics is the first compelling example of a robot butler I’ve ever seen. Google’s butler can:

Be given a problem in natural language e.g. I’ve spilled my drink. What should I do?
Work out the high level solution e.g. The spill should be cleaned
Identify the individual steps for this to happen e.g. Find cleaning products, go to the spill, wipe the spill
Check to see if steps can be followed e.g. look at itself to see if has usable arms, check the surroundings for required products
Execute the required steps e.g. picking the products, cleaning the spill

Given the rapid progress from paper to paper, it’s only a matter of time until more complicated tasks will be possible e.g. making my favourite omelette given all the required ingredients 🙏🏾

This is beyond lit

Lex Fridman Podcast

When I’ve wanted to chillout but still get some AI intel, I’ve been listening to the excellent Lex Fridman podcast, I highly recommend the following:

Episode 299 ~ Demis Hassabis: DeepMind – AI, super intelligence & the future of humanity
Episode 215 ~ Wojciech Zaremba: OpenAI Codex, GPT-3, robotics, and the future of AI

Miscellaneous

Below are resources I have found useful during for refreshing and extending my knowledge in various ways

Articles

Twitter threads

Videos

Product

Machine learning flashcards by Chris Albion

What am I exploring now?

Datacamp’s data scientist with python career track
Radek’s meta learning book
Chip’s introduction to machine learning interviews book

What did I learn from the challenges I’ve conquered?

Consistency beats intensity so optimise for consistency (it’s a marathon not a sprint)
Speaking out loud when learning helps massively with comprehension (it’s scientifically proven)
Rely on community to keep myself engaged (recently got my first accountability buddy and my first mentee)

My boss sent me this a month ago when I told her I was frustrated with my output. Reminder. pic.twitter.com/7z3g5Syac9
— InfoSteph (@StephandSec) June 8, 2022

Ride the wave, just ride the wave

What are my next steps?

Projects
- Finish finetuned GPT3 endeavour
Kaggle
- Progress further with phrase matching competition
- Start exploring the amex competition
Numerai
- Complete working through the numerai starter pack
Open source
- Make more open source contributions

Thanks for reading my latest update 🤖 Want to connect? Or see more frequent updates of my journey? Then feel free to reach out and follow me on the platform of your choice 😄

Trolley problem solved:pic.twitter.com/SZKF9ngMvP
— hardmaru (@hardmaru) April 15, 2022

In other news, the trolley problem has been solved. Huzzah 😂