Now a day coffee selection is grow much faster in many country and consume rate of the coffee is much bigger!. With so many selection of coffee and hard to classify quality of the beans, Coffee Quality Institute (CQI) come and be a part of quality measurement of the beans around the world so that why we can chose data from the Coffee Quality Institute’s review pages in January 2018 that use data scraping method from this github.

So I choose this data set to analyze some data to explore on the third wave coffee detail.


truly a way of appreciating a quality product, Matt Milletto of Water Avenue Coffee

The term ‘third wave coffee’ is a label given to coffee businesses opened between 2000 and now and which share a similar mission statement or goal; to deliver high quality coffee, Wikipedia

So it is a new way to enjoy coffee !

Coming with 3 question we want from this data

- Which country or region is best to select beans for roasting?  
- Is high acidity coffee lead to high score?
- Can we predict cup score like CQI do with some of this data?

Let’s start with some static on coffee country

Mexico send a lot of coffee to reviews.

Ethiopia coffee score perform very good in stat

but for the mean score we can see that USA is better

We can see that Ethiopia perform best cup score but if we see the mean of the score we can see that the USA is performed better but what if we look into the region of the country we can see that American region (North, center, south) produce a lot of goooood coffee! But if we can choose we choose Ethiopia beans then if we can’t we will surely choose beans from the America region!

What about sour taste?

acidity score among country
heat map of score criteria
what about other stat like flavor

As we can see high acidity can easily lead to a higher cup score and also can see that even high cup score country like Ethiopia also high acidity

Can we use these data to predict cup score?

we use region data , processing type, harvest year, attitude data and acidity tier which most of them can be easily identified by customer to predict or cup score with these detail

For region we separate in to 3 region America, Africa and Asia.

For acidity we separate to 4 level of acidity

For processing type we separate into dummy like natural, wash, honey that basic technique use by farmer base on our data

finish data look like

so we put these data together in our pipeline and train

the result of rmse is 2.778 so it isn’t that good accurate so it can be shown that these select feature isn’t that good choice for prediction using linear regression model so we better use other model or more specific features

let’s try our model

still near but not that good to use in practical

so fun to playing data like this you can try out at




