This past week I was trying to solve a tricky MySQL database problem involving ORDER, GROUP, subqueries, and aggregated data. The problem I needed to solve in essence contained a single table with rows of data that had columns containing a foreign key, and a DATE. The other columns are irrelevant to this discussion.
So to put the problem in perspective, I’ll explain the overall schema I was dealing with. There are two entities involved, each represented by their own data table. I have a table of tanks, and the tanks are related to a table of many inventories that are taken at random times from these tanks. For sake of brevity, these tanks contain liquids that we measure at different time intervals. With each tank reading, we record the level of liquid found and the date we took the measurement. Monthly reports get generated from this data where the idea is to find inventories closest to the end of the month. We want to find inventory dates closest to the last day of the month, but no further past the 8th day of the next month.
So for example, I want to create a report for November 2023. The data in this case is reporting for 2023-11-30, the last day of November. Tank 1 has readings on 2023-11-27 and 2023-12-02. Tank 2 has readings from 2023-11-09 and 2023-12-10. The data I want to get when I query for each tank then needs to follow the rules above.
In this case, Tank 1 would return an Inventory from 2023-12-02. December 2nd is closer to November 30th than November 27th by a day, so we get that as a result.
Tank 2 returns Inventory from 2023-11-09. November 9th is further from November 30th than December 10th, but December 10th goes beyond our 8 day threshold so we get back the earlier date in this case.
For simplicity, the data tables can be reduced to look like this:
id | tank_id | level | date |
1 | 2 | 329 | 2023-11-09 |
2 | 1 | 410 | 2023-11-27 |
3 | 1 | 399 | 2023-12-02 |
4 | 2 | 186 | 2023-12-10 |
id | name |
1 | Tank 1 |
2 | Tank 2 |
So the first step I took here involved finding the recent inventories for each tank. Using the constraints in the problem, I put together a query that looks like this:
select tank_id, MIN(ABS(DATEDIFF(date, "2023-11-30 23:59:59"))) as min_days_away from `inventories` where `tank_id` in (?, ?) and month(`date`) >= ? and date(`date`) <= ? group by `tank_id`
Running this query then give me the two results I’m looking for.
tank_id | min_days_away |
1 | 2 |
2 | 21 |
My initial approach to selecting data in this subquery was not taking me in the right direction. At first, I was also trying to capture the date in the inventories table so that I could select the inventory I wanted by date in my main query. The problem was that when I tried adding date to the SELECT, I had to then add data to the GROUP BY. Grouping by both the tank_id and date would then give me unique groups for every tank and date combination within the constraints. It will not give just the date closest to the end of the month, because each unique tank_id/date combination is the closest to the end of the month. I need to just group this by the tank.
The problem I found was that having the tank_id and min_days_away, I didn’t initially see how I could map the min_days_away in my outer query so I could join the inventories to all the other data tables I needed. The answer is much simpler than I thought, and ChatGPT knew it right away.
I went astray because every answer I found with Google wasn’t using multiple functions to aggregate data. The example I use here has to find a date within constraints, so I nested MIN(ABS(DATEDIFF())) function calls to find the minimum days away. All of the examples found with Google used a simpler constraint, like find a student with the lowest grade. In this instance, the SELECT statement would look more like:
select student_id, min(grade) as min_grade from `grades` group by student_id
If I’m writing a report that needs to find which test shows the students’ lowest grade, I now have student_id and a min_grade. With this data, I can search my grades table to get the test the student scored lowest on.
The data from a situation like this returns min_grade and that can immediately be referenced in the outer query like this:
select `lowest`.`student_id`, `lowest`.`min_grade`, `grades`.`test_id` from grades inner join (select student_id, min(grade) as min_grade from `grades` group by student_id) as `lowest` on `lowest`.`min_grade` = `grades`.`grade`
student_id | min_grade | test_id |
1 | 80 | 1 |
2 | 90 | 2 |
3 | 40 | 1 |
With an example like the students and grades mentioned above, it’s clear to see that I can now query the students table and join the grades table where grades.grade = min_grade to find out the test and other information that I might need. There is a direct one-to-one mapping. I didn’t initially see the relationship I was looking for in my tanks/inventories problem because I was being given back min_days_away, and I had no where to map that on my inventories table. So I took this problem further off track by trying to group the tank_id and date so I could map a min_date with date.
It wasn’t long until I took this frustration to ChatGPT and explained what I was trying to do.
The solution was right in front of me. I have the min_days here:
MIN(ABS(DATEDIFF(date, "2023-11-30 23:59:59")))
It’s the same thing as in the students example above, except I’m not mapping it directly to the column. I need to map to a function calling another function on a column.
select `recent`.`tank_id`, `inventories`.`date`, `recent`.`min_days_away` from `inventories` inner join (select tank_id, MIN(ABS(DATEDIFF(date, "2023-11-30 23:59:59"))) as min_days_away from `inventories` where `tank_id` in (?, ?) and month(`date`) >= ? and date(`date`) <= ? group by `tank_id`) as `recent` on `recent`.`min_days_away` = ABS(DATEDIFF(date, "2023-11-30 23:59:59")) and `inventories`.`tank_id` = `recent`.`tank_id`
I still have the one to one mapping I need to join my outer query. The answer is basically the same as the students/grades example. I think sometimes with the joins I overthink it and it usually makes sense to stop and take a step back. I used to try goofy things like overriding rules in SQL (ERROR 1055 (42000): Expression #2 of SELECT list is not in GROUP BY clause and contains nonaggregated column …) to get around these things.
Now I learn to do it properly, and write about it to fully understand the experience. If you read this, I hope it helped.