Want a better coffee for your store ?

Coffee Lover

Now a day coffee selection is grow much faster in many country and consume rate of the coffee is much bigger!. With so many selection of coffee and hard to classify quality of the beans, Coffee Quality Institute (CQI) come and be a part of quality measurement of the beans around the world so that why we can chose data from the Coffee Quality Institute’s review pages in January 2018 that use data scraping method from this github.

So I choose this data set to analyze some data to explore on the third wave coffee detail.


truly a way of appreciating a quality product, Matt Milletto of Water Avenue Coffee

The term ‘third wave coffee’ is a label given to coffee businesses opened between 2000 and now and which share a similar mission statement or goal; to deliver high quality coffee, Wikipedia

So it is a new way to enjoy coffee !

Coming with 3 question we want from this data

- Which country or region is best to select beans for roasting?  
- Is high acidity coffee lead to high score?
- Can we predict cup score like CQI do with some of this data?

Let’s start with some static on coffee country

Mexico send a lot of coffee to reviews.

Ethiopia coffee score perform very good in stat

but for the mean score we can see that USA is better

We can see that Ethiopia perform best cup score but if we see the mean of the score we can see that the USA is performed better but what if we look into the region of the country we can see that American region (North, center, south) produce a lot of goooood coffee! But if we can choose we choose Ethiopia beans then if we can’t we will surely choose beans from the America region!

What about sour taste?

acidity score among country
heat map of score criteria
what about other stat like flavor

As we can see high acidity can easily lead to a higher cup score and also can see that even high cup score country like Ethiopia also high acidity

Can we use these data to predict cup score?

we use region data , processing type, harvest year, attitude data and acidity tier which most of them can be easily identified by customer to predict or cup score with these detail

For region we separate in to 3 region America, Africa and Asia.

For acidity we separate to 4 level of acidity

For processing type we separate into dummy like natural, wash, honey that basic technique use by farmer base on our data

finish data look like

so we put these data together in our pipeline and train

the result of rmse is 2.778 so it isn’t that good accurate so it can be shown that these select feature isn’t that good choice for prediction using linear regression model so we better use other model or more specific features

let’s try our model

still near but not that good to use in practical

so fun to playing data like this you can try out at




Love podcasts or audiobooks? Learn on the go with our new app.

Recommended from Medium

Creating The Twitter Sentiment Analysis Program in Python with Naive Bayes Classification

Overview of Azure Resources for AI Applications

My First Virtual Data Science Internship Experience…

When Data Renders Invisible Illnesses Visible

Conda vs Pip

Logistic Regression for Beginners

Reimagine Adobe Analytics Classifications Upload with Google Sheets Data Connector

The Classification has been uploaded to Adobe Analytics via Google Sheets

Other Projects !

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store
Buddy Siriskaowkul

Buddy Siriskaowkul

More from Medium

Unconventional monetary policy challenges the Wicksellian approach

On Fostering Disagreement, Karl Popper’s Aesthetic

This month on the Planet Classroom Network, audiences can enjoy creators Sarah Heinz’s and Luke…

5 Workplace Trends affecting Commercial Real Estate in 2022.