===== Large Image Archives =====
Since around 2005 my mum has taken an average of 3,200 pictures a year. That's after I removed all duplicate files. I'm still deciding on what software to suggest she uses to manage her archive of around 43k photos. In the meantime I wanted to generate some stats with the goal of making a simple interface for her to browse the backup I have of her archive (by date and maybe location).
I've already removed duplicates, but lets save an md5 of the image to help stop duplicates in future imports and also keep a record of any duplicates we deleted (there might be valuable information in the directory path of those deleted duplicates that'll help us search/organise in the future):
// I've trimmed some database connection stuff and some code that fills the $dupe_arr variable
$data = file_get_contents("list-of-images.txt");
$lines = explode("\n", $data);
foreach($lines as $line) {
$line = trim($line);
$filename = basename($line);
$sql = "insert into images set relative_path=?, original_filename=?, md5sum=?, created_on=now(), extra_info=?";
$md5sum = md5_file($line);
$extra_info = "";
if(!empty($dupe_arr[$line]))
$extra_info = implode("\n", $dupe_arr[$line]);
$stmt = $db->prepare($sql);
$stmt->execute([$line, $filename, $md5sum, $extra_info]);
echo ".";
}
Now we have a list of 50k images in a database that has this simple schema:
mysql> describe images;
+-------------------+---------------+------+-----+---------+----------------+
| Field | Type | Null | Key | Default | Extra |
+-------------------+---------------+------+-----+---------+----------------+
| id | int(11) | NO | PRI | NULL | auto_increment |
| relative_path | varchar(1000) | YES | | NULL | |
| original_filename | varchar(1000) | YES | | NULL | |
| md5sum | varchar(100) | YES | | NULL | |
| created_on | datetime | YES | | NULL | |
| extra_info | varchar(2000) | YES | | NULL | |
+-------------------+---------------+------+-----+---------+----------------+
Now some more code that'll extract all the exif data it can find from these images. This just runs the exif command (''apt install exif'') with the ''-m'' parameter which, by default, outputs all tags in a machine readable (tabbed) format. Then saves it to a database table:
query($sql);
while($row = $stmt->fetch()) {
$image_id = $row['id'];
$filename = $row['relative_path'];
$data = `exif -m "$filename"`;
$lines = explode("\n", $data);
foreach($lines as $line) {
if($dline = explode("\t", $line)) {
if(!empty($dline[0])) {
$sql = "insert into exif_meta set image_id=?, field_name=?, field_value=?";
$stmt_ins = $db->prepare($sql);
$name = $dline[0];
$value = $dline[1];
$stmt_ins->execute([$image_id, $name, $value]);
}
}
}
}
?>
This populates a very simple table with this schema (I was just curious to look at the data - I'll extract what I need from here, or maybe add some indexes):
mysql> describe exif_meta;
+-------------+---------------+------+-----+---------+-------+
| Field | Type | Null | Key | Default | Extra |
+-------------+---------------+------+-----+---------+-------+
| image_id | int(11) | YES | | NULL | |
| field_name | varchar(1000) | YES | | NULL | |
| field_value | varchar(2000) | YES | | NULL | |
+-------------+---------------+------+-----+---------+-------+
==== Stats ====
Now, lets see what we got. 1.8 million rows for those ~50k images. Let's get actual.
| **Number of images** | 43,545 | ''select count(id) from images'' |
| **Images with EXIF data** | 33,945 | '' select count(distinct image_id) from exif_meta'' |
So, 9,600 images didn't have readable EXIF data. These are either corrupt files or images that have been resized and stripped of EXIF data or downloaded from somewhere that stripped it - or the camera just didn't support any type of EXIF at all, but that seems unlikely for the date period.
How many tags per image? **About 46**
| **Average** | 46.5 | ''select avg(count) from (select count(field_name) as count from exif_meta group by image_id) as counts'' |
| **Mean** | 48 | ''select count,count(*) from (select count(field_name) as count from exif_meta group by image_id) as counts group by count order by count(*)''|
| **Min** | 8 | ''as above'' |
| **Max** | 59 | '' as above |
==== EXIF ====
What did we get and how common are particular EXIF tags? **83 unique tags**. The most popular being:
^ field_name ^ number of tags ^
| X-Resolution | 67197 |
| Y-Resolution | 67197 |
| Resolution Unit | 67197 |
| Orientation | 39986 |
| YCbCr Positioning | 35871 |
| Software | 34891 |
| Date and Time | 34414 |
| Manufacturer | 34299 |
| Model | 34299 |
| Exif Version | 33945 |
| FlashPixVersion | 33945 |
| Colour Space | 33945 |
| Pixel X Dimension | 33664 |
| Pixel Y Dimension | 33664 |
| Components Configuration | 33662 |
| Scene Capture Type | 33266 |
| Compression | 33252 |
| ThumbnailSize | 33252 |
| Flash | 32324 |
| F-Number | 32322 |
| Metering Mode | 32322 |
| Date and Time (Digitised) | 32314 |
| Date and Time (Original) | 32289 |
| Exposure Program | 32249 |
| White Balance | 32244 |
| Exposure Mode | 32171 |
| Exposure Time | 31935 |
| Focal Length | 31933 |
| Maker Note | 30171 |
| Focal Length in 35mm Film | 29869 |
| ISO Speed Ratings | 27752 |
| Exposure Bias | 27238 |
| Scene Type | 20581 |
| Aperture | 20432 |
| Sensing Method | 20358 |
| Shutter Speed | 20043 |
| Digital Zoom Ratio | 18824 |
| Brightness | 17795 |
| Interoperability Index | 16090 |
| Interoperability Version | 16090 |
| Light Source | 16051 |
| Maximum Aperture Value | 16049 |
| File Source | 16049 |
| Sharpness | 16044 |
| Compressed Bits per Pixel | 15987 |
| Subject Area | 14918 |
| Sub-second Time (Original) | 14448 |
| Sub-second Time (Digitised) | 14448 |
| Custom Rendered | 13372 |
| Saturation | 12196 |
| Contrast | 11890 |
| PRINT Image Matching | 11717 |
| Gain Control | 9905 |
| Image Description | 8244 |
| North or South Latitude | 7078 |
| Latitude | 7078 |
| East or West Longitude | 7078 |
| Longitude | 7078 |
| GPS Time (Atomic Clock) | 7078 |
| Altitude Reference | 6904 |
| Altitude | 6904 |
| GPS Image Direction Reference | 6443 |
| GPS Image Direction | 6443 |
| Copyright | 4159 |
| GPS Date | 2284 |
| Speed Unit | 2281 |
| Speed of GPS Receiver | 2281 |
| User Comment | 2174 |
| Reference for Bearing of Destination | 2117 |
| Bearing of Destination | 2117 |
| Document Name | 1127 |
| Related Sound File | 311 |
| Exposure Index | 311 |
| Image Width | 99 |
| Image Length | 99 |
| Sub-second Time | 74 |
| Focal Plane X-Resolution | 74 |
| Focal Plane Y-Resolution | 74 |
| Focal Plane Resolution Unit | 74 |
| Padding | 56 |
| Measurement Precision | 53 |
| Subject Distance Range | 53 |
| Subject Distance | 2 |
I'm not sure why we got more of some tags (X-Resolution, Y-Resolution and Resolution Unit) - they must have been repeated in the output of ''exif -m''. The rest look accurate though. Let's dig into that now.
==== Data ====
What can we extract at a glance? When were the photos taken?
=== Date ===
select year(field_value) as year,count(image_id) as number_of_pics from exif_meta where field_name='date and time' group by year(field_value) order by year(field_value);
^ year ^ number_of_pics ^
| NULL | 2 |
| 2004 | 2 |
| 2005 | 2404 |
| 2006 | 1551 |
| 2007 | 495 |
| 2008 | 431 |
| 2009 | 565 |
| 2010 | 1044 |
| 2011 | 962 |
| 2012 | 1862 |
| 2013 | 2559 |
| 2014 | 3278 |
| 2015 | 5273 |
| 2016 | 6763 |
| 2017 | 1463 |
| 2018 | 4265 |
| 2019 | 1184 |
| 2026 | 311 |
That looks about right. At one point the camera must have had it's time and date set to the future.
=== Manufacturer ===
What type of camera did she use?
select count(image_id), field_value from exif_meta where field_name='Manufacturer' group by field_value order by count(image_id);
^ count(image_id) ^ field_value ^
| 2 | |
| 53 | NIKON |
| 62 | OLYMPUS IMAGING CORP. |
| 74 | Canon |
| 120 | Eastman Kodak Company |
| 311 | Samsung Techwin |
| 1315 | CASIO COMPUTER CO.,LTD |
| 3848 | SAMSUNG |
| 3970 | SONY |
| 8355 | CASIO COMPUTER CO.,LTD. |
| 16189 | Apple |
That also looks about right, most of the photos were taken on a phone.
==== Other EXIF ====
Let's take an overview of what some EXIF tags contain:
=== EXIF: X-Resolution ===
select count(image_id), field_value from exif_meta where field_name='X-Resolution' group by field_value order by count(image_id);
^ count(image_id) ^ field_value ^
| 2 | 150 |
| 2 | 75 |
| 2 | 96 |
| 62 | 314 |
| 106 | 300 |
| 311 | 1 |
| 311 | 96.0 |
| 1985 | 350 |
| 64416 | 72 |
=== EXIF: Orientation ===
select count(image_id), field_value from exif_meta where field_name='Orientation' group by field_value order by count(image_id);
^ count(image_id) ^ field_value ^
| 178 | Left-bottom |
| 1139 | Bottom-right |
| 3841 | Right-top |
| 34828 | Top-left |