Sorry your browser is not supported!

You are using an outdated browser that does not support modern web technologies, in order to use this site please update to a new browser.

Browsers supported include Chrome, FireFox, Safari, Opera, Internet Explorer 10+ or Microsoft Edge.

Geek Culture / improving search results of weird filenames?

Author
Message
Phaelax
DBPro Master
19
Years of Service
User Offline
Joined: 16th Apr 2003
Location: Metropia
Posted: 5th May 2017 02:43 Edited at: 5th May 2017 02:45
Working on a little home project, meaning the php library I've written won't be released. (it's a scraper for a popular movie info site) Attempting to get simple info about each one of my movies stored locally on my computer, however, the filenames aren't all that pretty. If it's just the name like "Hackers", then search results are pretty effective. Even "Hackers (1995)" is ok. But some files might have other garbage in the filename, such as extra date information like a timestamp or states type of encoding. Here's an example: "Run.Ronnie.Run[2002].Eng.divx.avi"

Some use periods as delimiters, others use spaces, some dashes. I might have a few dozen movies following one naming format, then a few dozen more following something else. I'd rather not have to go through and rename all 200+ movie files, although I'll be more careful how I rip in the future and stick to one scheme. I've already done that with TV episodes and it took me hours of manual input to follow the SxxExx scheme..


Basically, I'm looking for some ideas how I might be able to strip out some of the garbage in hopes of improving the search. Video Station on my qnap manages to search results for about 90% of my files as is, and I'm technically using their same source for looking up movie titles. I just don't know how they managed to get that many titles correct given the filenames.


p.s. And if this ends up locked due to questionable legality because I mention ripping and scraping, I understand though I own these movies and my scraper isn't being published or commercialized.

"I like offending people, because I think people who get offended should be offended." - Linus Torvalds
Phaelax
DBPro Master
19
Years of Service
User Offline
Joined: 16th Apr 2003
Location: Metropia
Posted: 5th May 2017 18:05
Nevermind I guess. I went ahead and just renamed all the movies. Took a little time, but not nearly as long as I thought. Definitely not as long as renaming 2k tv episodes.

Although, because of this I did come across a faster method for renaming files in Windows. I had no idea these shortcuts exists.
http://www.ubergizmo.com/how-to/batch-rename-files-windows/

"I like offending people, because I think people who get offended should be offended." - Linus Torvalds
Green Gandalf
VIP Member
17
Years of Service
User Offline
Joined: 3rd Jan 2005
Playing: Malevolence:Sword of Ahkranox, Skyrim, Civ6.
Posted: 6th May 2017 18:43
Quote: "I had no idea these shortcuts exists."


Neither did I.

Thanks!
Jeku
Moderator
19
Years of Service
User Offline
Joined: 4th Jul 2003
Location: Vancouver, British Columbia, Canada
Posted: 27th May 2017 03:00 Edited at: 27th May 2017 03:06
I used to work at the largest lyrics website on the Internet so I have all kinds of nifty tricks in my PHP wheelhouse for matching artists, song titles, and even lyrics. A book could be written on the different ways to do this!

Lately I've built up a famous quotes website and in order to make sure I don't put duplicate quotes in I strip out every non-alphanumeric character, and lowercase everything. Then I replace all spaces with a single dash. This seems to work fairly fine.

So in my DB I'll have a full quote like:

"I'm involved in some action scenes, so they'll train me for that. I'll be working with my acting coach to prepare for my character."

And it will be sluggified to:

"im-involved-in-some-action-scenes-so-theyll-train-me-for-that-ill-be-working-with-my-acting-coach-to-prepare-for-my-character"

The second one will be the one for matching while the first one will be for displaying. Since it's a 1-direction conversion, unfortunately I have to store each version in the db, but it saves a lot of grief and has already detected many duplicates.

Since dealing with movie titles is a bit different, you could also research some bayesian matching functions that are available in PHP. It uses fuzzy logic and you can set the threshold if I remember correct. That way you could match a string to a movie, even if the string has some extra junk in there.

EDIT:

Here's some simple code I used for doing this conversion. Order is important of course, because the second command will clear out the spaces if done first:
Senior Software Engineer - RotoGrinders

Login to post a reply

Server time is: 2022-11-27 14:32:31
Your offset time is: 2022-11-27 14:32:31