Jump to content

Resume Renamer 260120: Difference between revisions

From Game in the Brain Wiki
No edit summary
Line 2: Line 2:
Students and applicants rarely follow file naming conventions. You likely have a folder that looks like this:
Students and applicants rarely follow file naming conventions. You likely have a folder that looks like this:


<code>Resume.pdf</code>
Resume.pdf


<code>CV_Final_v2.docx</code>
CV_Final_v2.docx


<code>MyResume(1).pdf</code>
MyResume(1).pdf


<code>john_doe.pdf</code>
john_doe.pdf


This makes sorting by date or qualification impossible without opening every single file.
This makes sorting by date or qualification impossible without opening every single file.


'''The Goal:''' Automatically rename these files based on their '''content''' to a standard format:
'''The Goal:''' Automatically rename these files based on their '''content''' to a standard format:
: <code>YYMMDD Name Degree/Background.pdf</code>
: YYMMDD Name Degree/Background.pdf
: ''Example:'' <code>250101 Juan Dela Cruz BS Information Technology.pdf</code>
: ''Example:'' 250101 Juan Dela Cruz BS Information Technology.pdf


== 2. Requirements Checklist ==
== 2. Requirements Checklist ==
Line 25: Line 25:
[ ] '''Ollama''' installed locally (The AI engine).
[ ] '''Ollama''' installed locally (The AI engine).


[ ] '''A Small Language Model''' pulled (e.g., <code>granite3.3:2b</code> or <code>llama3.2</code>).
[ ] '''A Small Language Model''' pulled (e.g., granite3.3:2b or llama3.2).
*: ''Note: Small models are fast but can make mistakes. The script has logic to catch these, but a human review is always recommended.''
*: ''Note: Small models are fast but can make mistakes. The script has logic to catch these, but a human review is always recommended.''


[ ] '''Python Libraries:''' <code>pdfplumber</code> (for PDFs), <code>python-docx</code> (for Word), <code>requests</code> (to talk to Ollama).
[ ] '''Python Libraries:''' pdfplumber (for PDFs), python-docx (for Word), requests (to talk to Ollama).


[ ] '''No Images:''' The files must have '''embedded text'''. This script excludes OCR (Optical Character Recognition) to keep it fast and lightweight. Pure image scans will be skipped.
[ ] '''No Images:''' The files must have '''embedded text'''. This script excludes OCR (Optical Character Recognition) to keep it fast and lightweight. Pure image scans will be skipped.
Line 37: Line 37:
'''File Discovery:'''
'''File Discovery:'''


#* The script looks for <code>.pdf</code> and <code>.docx</code> files in the folder where the script is located.
#* The script looks for .pdf and .docx files in the folder where the script is located.


'''Text Extraction:'''
'''Text Extraction:'''
Line 47: Line 47:
#* '''Logic:''' It scans the text for '''explicit years''' (e.g., "2023", "2024").
#* '''Logic:''' It scans the text for '''explicit years''' (e.g., "2023", "2024").
#* '''Rule:''' It ignores the word "Present". Why? If a resume from 2022 says "2022 - Present", treating "Present" as "Today" (2026) would incorrectly date the old resume. We stick to the highest printed number.
#* '''Rule:''' It ignores the word "Present". Why? If a resume from 2022 says "2022 - Present", treating "Present" as "Today" (2026) would incorrectly date the old resume. We stick to the highest printed number.
#* '''Output:''' Sets the date to Jan 1st of the highest year found (e.g., <code>240101</code>).
#* '''Output:''' Sets the date to Jan 1st of the highest year found (e.g., 240101).


'''The Content Specialist (Ollama AI):'''
'''The Content Specialist (Ollama AI):'''
Line 57: Line 57:
'''Sanitization & Renaming:'''
'''Sanitization & Renaming:'''


#* It fixes "Spaced Names" (e.g., <code>J O H N</code> -> <code>John</code>).
#* It fixes "Spaced Names" (e.g., J O H N -> John).
#* It ensures the filename isn't too long.
#* It ensures the filename isn't too long.
#* It renames the file only if the name doesn't already exist.
#* It renames the file only if the name doesn't already exist.


== 4. Installation Guide (Ubuntu 24.04) ==
== 4. Installation Guide (Ubuntu 24.04) ==
Open your terminal (<code>Ctrl+Alt+T</code>) and follow these steps exactly.
Open your terminal (Ctrl+Alt+T) and follow these steps exactly.


=== Step A: System Update ===
=== Step A: System Update ===
Ensure your system tools are fresh to avoid installation conflicts.
Ensure your system tools are fresh to avoid installation conflicts.


<pre>
<syntaxhighlight lang="bash">
sudo apt update && sudo apt upgrade -y
sudo apt update && sudo apt upgrade -y
</pre>
</syntaxhighlight>


=== Step B: Install Ollama & The Model ===
=== Step B: Install Ollama & The Model ===
Line 75: Line 75:
'''Install the Ollama Engine:'''
'''Install the Ollama Engine:'''


#:<pre>curl -fsSL https://ollama.com/install.sh | sh</pre>
#:<syntaxhighlight lang="bash">curl -fsSL https://ollama.com/install.sh | sh</syntaxhighlight>


'''Download the Brain (The Model):'''
'''Download the Brain (The Model):'''


#:We use <code>granite3.3:2b</code> because it is very fast.
#:We use granite3.3:2b because it is very fast.
#:<pre>ollama pull granite3.3:2b</pre>
#:<syntaxhighlight lang="bash">ollama pull granite3.3:2b</syntaxhighlight>


=== Step C: Setup Python Environment ===
=== Step C: Setup Python Environment ===
Ubuntu 24.04 requires Virtual Environments (<code>venv</code>) for Python scripts.
Ubuntu 24.04 requires Virtual Environments (venv) for Python scripts.


'''Create a Project Folder:'''
'''Create a Project Folder:'''


#:<pre> mkdir ~/resume_renamer cd ~/resume_renamer
#:<syntaxhighlight lang="bash">
</pre>
mkdir ~/resume_renamer
cd ~/resume_renamer
</syntaxhighlight>


'''Create the Virtual Environment:'''
'''Create the Virtual Environment:'''


#:<pre>python3 -m venv venv</pre>
#:<syntaxhighlight lang="bash">python3 -m venv venv</syntaxhighlight>


'''Activate the Environment:'''
'''Activate the Environment:'''


#:<pre>source venv/bin/activate</pre>
#:<syntaxhighlight lang="bash">source venv/bin/activate</syntaxhighlight>
#:(You should see <code>(venv)</code> at the start of your command line now).
#:(You should see (venv) at the start of your command line now).


'''Install Required Libraries:'''
'''Install Required Libraries:'''


#:<pre>pip install requests pdfplumber python-docx</pre>
#:<syntaxhighlight lang="bash">pip install requests pdfplumber python-docx</syntaxhighlight>


=== Step D: Create the Script ===
=== Step D: Create the Script ===
Line 107: Line 109:
Create the python file:
Create the python file:


#:<pre>nano rename_resumes.py</pre>
#:<syntaxhighlight lang="bash">nano rename_resumes.py</syntaxhighlight>


'''Paste the Python code''' provided in the appendix below.
'''Paste the Python code''' provided in the appendix below.


Save and exit: Press <code>Ctrl+O</code>, <code>Enter</code>, then <code>Ctrl+X</code>.
Save and exit: Press Ctrl+O, Enter, then Ctrl+X.


== 5. Running the Renamer ==
== 5. Running the Renamer ==
This script is '''portable'''. It works on the files sitting next to it.
This script is '''portable'''. It works on the files sitting next to it.


'''Copy the Script:''' Move the <code>rename_resumes.py</code> file into your folder full of PDFs (e.g., <code>~/Documents/Student_CVs</code>).
'''Copy the Script:''' Move the rename_resumes.py file into your folder full of PDFs (e.g., ~/Documents/Student_CVs).


'''Open Terminal in that folder:'''
'''Open Terminal in that folder:'''


#:<pre>cd ~/Documents/Student_CVs</pre>
#:<syntaxhighlight lang="bash">cd ~/Documents/Student_CVs</syntaxhighlight>


'''Activate your Python Environment (Point to where you created it):'''
'''Activate your Python Environment (Point to where you created it):'''


#:<pre>source ~/resume_renamer/venv/bin/activate</pre>
#:<syntaxhighlight lang="bash">source ~/resume_renamer/venv/bin/activate</syntaxhighlight>


'''Run the script:'''
'''Run the script:'''


#:<pre>python3 rename_resumes.py</pre>
#:<syntaxhighlight lang="bash">python3 rename_resumes.py</syntaxhighlight>


== 6. Common Errors & Troubleshooting ==
== 6. Common Errors & Troubleshooting ==
Line 146: Line 148:


== Appendix: The Python Script ==
== Appendix: The Python Script ==
Copy the code below into <code>rename_resumes.py</code>.
Copy the code below into rename_resumes.py.


<pre>
<syntaxhighlight lang="python">
import os
import os
import requests
import requests
Line 193: Line 195:
found_years = []
found_years = []


# 1. Standard Years (e.g., &quot;2024&quot;, &quot;2023-2024&quot;)
# 1. Standard Years (e.g., "2024", "2023-2024")
matches_standard = re.findall(r&#39;(?&lt;!\d)(20[0-5][0-9])(?!\d)&#39;, text)
matches_standard = re.findall(r'(?<!\d)(20[0-5][0-9])(?!\d)', text)
if matches_standard:
if matches_standard:
     found_years.extend([int(y) for y in matches_standard])
     found_years.extend([int(y) for y in matches_standard])


# 2. Spaced Years (e.g., &quot;2 0 2 4&quot;)
# 2. Spaced Years (e.g., "2 0 2 4")
matches_spaced = re.findall(r&#39;(?&lt;!\d)2\s+0\s+[0-5]\s+[0-9](?!\d)&#39;, text)
matches_spaced = re.findall(r'(?<!\d)2\s+0\s+[0-5]\s+[0-9](?!\d)', text)
if matches_spaced:
if matches_spaced:
     for m in matches_spaced:
     for m in matches_spaced:
         clean_year = int(m.replace(&quot; &quot;, &quot;&quot;))
         clean_year = int(m.replace(" ", ""))
         found_years.append(clean_year)
         found_years.append(clean_year)


if found_years:
if found_years:
     valid_years = [y for y in found_years if y &lt;= current_year + 5]
     valid_years = [y for y in found_years if y <= current_year + 5]
      
      
     if valid_years:
     if valid_years:
         latest_year = max(valid_years)
         latest_year = max(valid_years)
         short_year = str(latest_year)[2:]
         short_year = str(latest_year)[2:]
         return f&quot;{short_year}0101&quot;
         return f"{short_year}0101"


return None
return None
Line 252: Line 254:
)
)


prompt = f&quot;Resume Text:\n{text}\n\n{system_instruction}&quot;
prompt = f"Resume Text:\n{text}\n\n{system_instruction}"


url = &quot;http://localhost:11434/api/generate&quot;
url = "http://localhost:11434/api/generate"
data = {
data = {
     &quot;model&quot;: OLLAMA_MODEL,
     "model": OLLAMA_MODEL,
     &quot;prompt&quot;: prompt,
     "prompt": prompt,
     &quot;stream&quot;: False,
     "stream": False,
     &quot;options&quot;: {
     "options": {
         &quot;temperature&quot;: 0.1,  
         "temperature": 0.1,  
         &quot;num_ctx&quot;: 4096  
         "num_ctx": 4096  
     }
     }
}
}
Line 269: Line 271:
     response = requests.post(url, json=data, timeout=60)
     response = requests.post(url, json=data, timeout=60)
     response.raise_for_status()
     response.raise_for_status()
     result = response.json()[&#39;response&#39;].strip()
     result = response.json()['response'].strip()
     return result
     return result
except Exception as e:
except Exception as e:
     print(f&quot;   [Warning] Ollama call failed: {e}&quot;)
     print(f"   [Warning] Ollama call failed: {e}")
     return None
     return None


Line 288: Line 290:


# Truncate to safe filename length
# Truncate to safe filename length
if len(s) &gt; 60:
if len(s) > 60:
     s = s[:60].strip()
     s = s[:60].strip()
      
      
Line 301: Line 303:
lines = [line.strip() for line in text.split('\n') if line.strip()]
lines = [line.strip() for line in text.split('\n') if line.strip()]


ignore_list = [&#39;resume&#39;, &#39;curriculum vitae&#39;, &#39;cv&#39;, &#39;profile&#39;, &#39;bio&#39;, &#39;page&#39;, &#39;summary&#39;, &#39;objective&#39;, &#39;name&#39;, &#39;contact&#39;]
ignore_list = ['resume', 'curriculum vitae', 'cv', 'profile', 'bio', 'page', 'summary', 'objective', 'name', 'contact']


for line in lines:
for line in lines:
     lower_line = line.lower()
     lower_line = line.lower()
     if len(line) &lt; 3 or any(w in lower_line for w in ignore_list):
     if len(line) < 3 or any(w in lower_line for w in ignore_list):
         continue
         continue
      
      
     word_count = len(line.split())
     word_count = len(line.split())
     if word_count &gt; 5: continue # Names rarely have &gt;5 words
     if word_count > 5: continue # Names rarely have >5 words
     if &quot;looking for&quot; in lower_line or &quot;seeking&quot; in lower_line: continue
     if "looking for" in lower_line or "seeking" in lower_line: continue


     if len(line) &lt; 50 and not re.search(r&#39;[0-9!@#$%^&amp;*()_+={};&quot;&lt;&gt;?]&#39;, line):
     if len(line) < 50 and not re.search(r'[0-9!@#$%^&*()_+={};"<>?]', line):
         print(f&quot;   [Fallback] AI failed. Guessed name from first line: {line}&quot;)
         print(f"   [Fallback] AI failed. Guessed name from first line: {line}")
         return line
         return line
          
          
return &quot;Unknown Applicant&quot;
return "Unknown Applicant"




Line 333: Line 335:
         continue
         continue
      
      
     if file_ext == &#39;.docx&#39; and not DOCX_AVAILABLE:
     if file_ext == '.docx' and not DOCX_AVAILABLE:
         continue
         continue
      
      
     if file_ext not in [&#39;.pdf&#39;, &#39;.docx&#39;]:
     if file_ext not in ['.pdf', '.docx']:
         continue
         continue


     filepath = os.path.join(FOLDER_PATH, filename)
     filepath = os.path.join(FOLDER_PATH, filename)
     text = &quot;&quot;
     text = ""
      
      
     # 2. Extract Text
     # 2. Extract Text
     print(f&quot;Processing: {filename}...&quot;)
     print(f"Processing: {filename}...")
     try:
     try:
         if file_ext == &#39;.pdf&#39;:
         if file_ext == '.pdf':
             with pdfplumber.open(filepath) as pdf:
             with pdfplumber.open(filepath) as pdf:
                 for i in range(min(2, len(pdf.pages))):
                 for i in range(min(2, len(pdf.pages))):
                     text += pdf.pages[i].extract_text() or &quot;&quot;
                     text += pdf.pages[i].extract_text() or ""
         elif file_ext == &#39;.docx&#39;:
         elif file_ext == '.docx':
             text = extract_text_from_docx(filepath)
             text = extract_text_from_docx(filepath)
              
              
         if len(text) &lt; 50:
         if len(text) < 50:
             print(f&quot;   [SKIP] Text too short.&quot;)
             print(f"   [SKIP] Text too short.")
             count_fail += 1
             count_fail += 1
             continue
             continue
              
              
     except Exception as e:
     except Exception as e:
         print(f&quot;   [ERROR] Reading file: {e}&quot;)
         print(f"   [ERROR] Reading file: {e}")
         count_fail += 1
         count_fail += 1
         continue
         continue
Line 366: Line 368:
     if not date_str:
     if not date_str:
         date_str = get_os_creation_date(filepath)
         date_str = get_os_creation_date(filepath)
         print(f&quot;   [Fallback] Using OS Date: {date_str}&quot;)
         print(f"   [Fallback] Using OS Date: {date_str}")


     # 4. GET NAME/BG
     # 4. GET NAME/BG
Line 374: Line 376:
      
      
     name = None
     name = None
     bg = &quot;General&quot;
     bg = "General"


     if llm_output:
     if llm_output:
         if &quot;|&quot; in llm_output:
         if "|" in llm_output:
             parts = llm_output.split(&#39;|&#39;, 1)
             parts = llm_output.split('|', 1)
             name = parts[0].strip()
             name = parts[0].strip()
             bg = parts[1].strip()
             bg = parts[1].strip()
         elif &quot;\n&quot; in llm_output:
         elif "\n" in llm_output:
             lines = [line.strip() for line in llm_output.split(&#39;\n&#39;) if line.strip()]
             lines = [line.strip() for line in llm_output.split('\n') if line.strip()]
             if len(lines) &gt;= 2:
             if len(lines) >= 2:
                 name = lines[0]
                 name = lines[0]
                 bg = lines[1]
                 bg = lines[1]
          
          
         # --- IMPROVED FALLBACK CHECK ---
         # --- IMPROVED FALLBACK CHECK ---
         forbidden_names = [&quot;name&quot;, &quot;unknown&quot;, &quot;resume&quot;, &quot;applicant&quot;, &quot;candidate&quot;, &quot;full name&quot;]
         forbidden_names = ["name", "unknown", "resume", "applicant", "candidate", "full name"]
         if not name or name.strip().lower() in forbidden_names:
         if not name or name.strip().lower() in forbidden_names:
             name = get_name_fallback(text)
             name = get_name_fallback(text)
Line 397: Line 399:
             bg = clean_extracted_string(bg)
             bg = clean_extracted_string(bg)
              
              
             safe_name = re.sub(r&#39;[^\w\s-]&#39;, &#39;&#39;, name)
             safe_name = re.sub(r'[^\w\s-]', '', name)
             safe_bg = re.sub(r&#39;[^\w\s-]&#39;, &#39;&#39;, bg)
             safe_bg = re.sub(r'[^\w\s-]', '', bg)
              
              
             new_filename = f&quot;{date_str} {safe_name} {safe_bg}{file_ext}&quot;
             new_filename = f"{date_str} {safe_name} {safe_bg}{file_ext}"
             new_filepath = os.path.join(FOLDER_PATH, new_filename)
             new_filepath = os.path.join(FOLDER_PATH, new_filename)
              
              
Line 406: Line 408:
                 if not os.path.exists(new_filepath):
                 if not os.path.exists(new_filepath):
                     os.rename(filepath, new_filepath)
                     os.rename(filepath, new_filepath)
                     print(f&quot;   -&gt; Renamed: [{new_filename}]&quot;)
                     print(f"   -> Renamed: [{new_filename}]")
                     count_success += 1
                     count_success += 1
                 else:
                 else:
                     print(f&quot;   -&gt; Duplicate: [{new_filename}]&quot;)
                     print(f"   -> Duplicate: [{new_filename}]")
             else:
             else:
                 print(&quot;   -&gt; No change.&quot;)
                 print("   -> No change.")
         else:
         else:
             print(f&quot;   -&gt; AI Format Fail: {llm_output}&quot;)
             print(f"   -> AI Format Fail: {llm_output}")
             count_fail += 1
             count_fail += 1
     else:
     else:
         print(&quot;   -&gt; AI returned nothing.&quot;)
         print("   -> AI returned nothing.")
         count_fail += 1
         count_fail += 1


print(f&quot;\nDone! Renamed: {count_success} | Failed: {count_fail}&quot;)
print(f"\nDone! Renamed: {count_success} | Failed: {count_fail}")




if name == "main":
if name == "main":
process_folder()
process_folder()
</pre>
</syntaxhighlight>

Revision as of 09:36, 26 January 2026

1. The Problem

Students and applicants rarely follow file naming conventions. You likely have a folder that looks like this:

Resume.pdf

CV_Final_v2.docx

MyResume(1).pdf

john_doe.pdf

This makes sorting by date or qualification impossible without opening every single file.

The Goal: Automatically rename these files based on their content to a standard format:

YYMMDD Name Degree/Background.pdf
Example: 250101 Juan Dela Cruz BS Information Technology.pdf

2. Requirements Checklist

Please ensure you have the following ready before starting.

[ ] Ubuntu 24.04 System.

[ ] Python 3.12+ (Pre-installed on Ubuntu 24.04).

[ ] Ollama installed locally (The AI engine).

[ ] A Small Language Model pulled (e.g., granite3.3:2b or llama3.2).

  • Note: Small models are fast but can make mistakes. The script has logic to catch these, but a human review is always recommended.

[ ] Python Libraries: pdfplumber (for PDFs), python-docx (for Word), requests (to talk to Ollama).

[ ] No Images: The files must have embedded text. This script excludes OCR (Optical Character Recognition) to keep it fast and lightweight. Pure image scans will be skipped.

3. How the Script Works (The Logic)

This script acts as a "Project Manager" that hires two distinct specialists to process each file. It does not blindly ask the AI for everything, as small AIs make mistakes with math and dates.

File Discovery:

    • The script looks for .pdf and .docx files in the folder where the script is located.

Text Extraction:

    • It pulls raw text. If the text is less than 50 characters (likely an image scan), it skips the file.

The Date Specialist (Python Regex):

    • Logic: It scans the text for explicit years (e.g., "2023", "2024").
    • Rule: It ignores the word "Present". Why? If a resume from 2022 says "2022 - Present", treating "Present" as "Today" (2026) would incorrectly date the old resume. We stick to the highest printed number.
    • Output: Sets the date to Jan 1st of the highest year found (e.g., 240101).

The Content Specialist (Ollama AI):

    • Logic: It sends the text to the local AI with strict instructions.
    • Rule 1 (Priority): It looks for a Degree (e.g., "BS IT") first. It is forbidden from using "Intern" or "Student" if a degree is found.
    • Rule 2 (Fallback): If the AI fails to find a name, the script grabs the first line of the document as a fallback.

Sanitization & Renaming:

    • It fixes "Spaced Names" (e.g., J O H N -> John).
    • It ensures the filename isn't too long.
    • It renames the file only if the name doesn't already exist.

4. Installation Guide (Ubuntu 24.04)

Open your terminal (Ctrl+Alt+T) and follow these steps exactly.

Step A: System Update

Ensure your system tools are fresh to avoid installation conflicts.

sudo apt update && sudo apt upgrade -y

Step B: Install Ollama & The Model

Install the Ollama Engine:

  1. curl -fsSL https://ollama.com/install.sh | sh
    

Download the Brain (The Model):

  1. We use granite3.3:2b because it is very fast.
    ollama pull granite3.3:2b
    

Step C: Setup Python Environment

Ubuntu 24.04 requires Virtual Environments (venv) for Python scripts.

Create a Project Folder:

  1. mkdir ~/resume_renamer
    cd ~/resume_renamer
    

Create the Virtual Environment:

  1. python3 -m venv venv
    

Activate the Environment:

  1. source venv/bin/activate
    
    (You should see (venv) at the start of your command line now).

Install Required Libraries:

  1. pip install requests pdfplumber python-docx
    

Step D: Create the Script

Create the python file:

  1. nano rename_resumes.py
    

Paste the Python code provided in the appendix below.

Save and exit: Press Ctrl+O, Enter, then Ctrl+X.

5. Running the Renamer

This script is portable. It works on the files sitting next to it.

Copy the Script: Move the rename_resumes.py file into your folder full of PDFs (e.g., ~/Documents/Student_CVs).

Open Terminal in that folder:

  1. cd ~/Documents/Student_CVs
    

Activate your Python Environment (Point to where you created it):

  1. source ~/resume_renamer/venv/bin/activate
    

Run the script:

  1. python3 rename_resumes.py
    

6. Common Errors & Troubleshooting

Error / Behavior Why it happens The Fix (Included in Script)
"Intern" instead of "Degree" The Resume had "INTERN" in big bold letters. The script's prompt explicitly forbids "Intern" if a Degree is found.
Wrong Date (e.g., 260101) The resume said "2021-Present" and the script assumed "Present" = 2026. We disabled "Present" logic. It now only trusts explicit numbers (e.g., 2021).
Spaced Names (J O H N) PDF formatting added spaces between letters. A Regex function detects single letters + spaces and collapses them.
Script Freezes Ollama is overwhelmed. We added a 60-second timeout and a 0.5s pause between files.
Skipped Files The PDF is a scanned image (no text). This is intended. You need an OCR tool for these (not included here).

Appendix: The Python Script

Copy the code below into rename_resumes.py.

import os
import requests
import json
import pdfplumber
import re
from datetime import datetime
import time

--- OPTIONAL DEPENDENCY: python-docx ---

DOCX_AVAILABLE = False
try:
from docx import Document
DOCX_AVAILABLE = True
except ImportError:
print("Warning: 'python-docx' not found. .docx files will be skipped.")
print("To support Word docs, run: pip install python-docx")

--- CONFIGURATION ---

FOLDER_PATH = os.path.dirname(os.path.abspath(file))

You can change this to "llama3" or "mistral" if installed

OLLAMA_MODEL = "granite3.3:2b"

---------------------

def get_os_creation_date(filepath):
"""Last resort: Gets OS file creation date in YYMMDD format."""
try:
timestamp = os.path.getctime(filepath)
return datetime.fromtimestamp(timestamp).strftime('%y%m%d')
except:
return datetime.now().strftime('%y%m%d')

def extract_latest_year_heuristic(text):
"""
Scans for years (2000-2059), including spaced years (2 0 2 4).
Returns the HIGHEST year found.
"""
current_year = datetime.now().year
found_years = []

# 1. Standard Years (e.g., "2024", "2023-2024")
matches_standard = re.findall(r'(?<!\d)(20[0-5][0-9])(?!\d)', text)
if matches_standard:
    found_years.extend([int(y) for y in matches_standard])

# 2. Spaced Years (e.g., "2 0 2 4")
matches_spaced = re.findall(r'(?<!\d)2\s+0\s+[0-5]\s+[0-9](?!\d)', text)
if matches_spaced:
    for m in matches_spaced:
        clean_year = int(m.replace(" ", ""))
        found_years.append(clean_year)

if found_years:
    valid_years = [y for y in found_years if y <= current_year + 5]
    
    if valid_years:
        latest_year = max(valid_years)
        short_year = str(latest_year)[2:]
        return f"{short_year}0101"

return None


def extract_text_from_docx(filepath):
"""Reads text from .docx files, including tables."""
if not DOCX_AVAILABLE:
return ""
try:
doc = Document(filepath)
full_text = []
for para in doc.paragraphs:
full_text.append(para.text)
for table in doc.tables:
for row in table.rows:
for cell in row.cells:
full_text.append(cell.text)
return "\n".join(full_text)
except Exception as e:
print(f"[ERROR] Reading DOCX: {e}")
return ""

def clean_text_for_llm(text):
clean = " ".join(text.split())
# Limit to 4000 chars to prevent choking small models
return clean[:4000]

def ask_ollama(text):
system_instruction = (
"You are a data extraction assistant. "
"Extract the applicant's Full Name and Background."
"\n\nBackground Extraction Rules (STRICT):\n"
"1. MANDATORY: You MUST prefer the Educational Degree over any job title.\n"
"   - Example: If text says 'IT Intern' AND 'Diploma in Information Technology', output 'Diploma in Information Technology'.\n"
"   - Example: If text says 'Mechanical Engineering Student', output 'Diploma in Mechanical Engineering' (if listed) or 'Mechanical Engineering'.\n"
"2. FORBIDDEN: Do NOT use 'Intern', 'Student', 'Assistant', or 'Worker' as the background unless NO degree is mentioned.\n"
"\nOutput strictly in this format: Name | Background."
"\nDo NOT include notes, explanations, or numbered lists."
)

prompt = f"Resume Text:\n{text}\n\n{system_instruction}"

url = "http://localhost:11434/api/generate"
data = {
    "model": OLLAMA_MODEL,
    "prompt": prompt,
    "stream": False,
    "options": {
        "temperature": 0.1, 
        "num_ctx": 4096 
    }
}

try:
    # Added timeout to prevent hanging on one file
    response = requests.post(url, json=data, timeout=60)
    response.raise_for_status()
    result = response.json()['response'].strip()
    return result
except Exception as e:
    print(f"    [Warning] Ollama call failed: {e}")
    return None


def fix_spaced_names(text):
# Fixes "J O H N" -> "JOHN"
return re.sub(r'(?<=\b[A-Za-z])\s+(?=[A-Za-z]\b)', '', text)

def clean_extracted_string(s):
# Remove lists (1.), labels (Name:), and fix spacing
s = re.sub(r'^(1.|2.|Name:|Background:|\d\W)', '', s, flags=re.IGNORECASE)
s = fix_spaced_names(s)
s = s.split('\n')[0]
s = re.split(r'(?i)note\s*:', s)[0]

# Truncate to safe filename length
if len(s) > 60:
    s = s[:60].strip()
    
return s.strip().title()


def get_name_fallback(text):
"""
If AI returns 'Name' or 'Unknown', this function grabs the
first non-empty line of the resume, which is usually the name.
"""
lines = [line.strip() for line in text.split('\n') if line.strip()]

ignore_list = ['resume', 'curriculum vitae', 'cv', 'profile', 'bio', 'page', 'summary', 'objective', 'name', 'contact']

for line in lines:
    lower_line = line.lower()
    if len(line) < 3 or any(w in lower_line for w in ignore_list):
        continue
    
    word_count = len(line.split())
    if word_count > 5: continue # Names rarely have >5 words
    if "looking for" in lower_line or "seeking" in lower_line: continue

    if len(line) < 50 and not re.search(r'[0-9!@#$%^&*()_+={};"<>?]', line):
        print(f"    [Fallback] AI failed. Guessed name from first line: {line}")
        return line
        
return "Unknown Applicant"


def process_folder():
print(f"--- Resume Renamer (Strict Degree Priority + Resilient) ---")
print(f"Working in: {FOLDER_PATH}\n")

count_success = 0
count_fail = 0
script_name = os.path.basename(__file__)

for filename in os.listdir(FOLDER_PATH):
    # 1. Check Extension
    file_ext = os.path.splitext(filename)[1].lower()
    if filename == script_name:
        continue
    
    if file_ext == '.docx' and not DOCX_AVAILABLE:
        continue
    
    if file_ext not in ['.pdf', '.docx']:
        continue

    filepath = os.path.join(FOLDER_PATH, filename)
    text = ""
    
    # 2. Extract Text
    print(f"Processing: {filename}...")
    try:
        if file_ext == '.pdf':
            with pdfplumber.open(filepath) as pdf:
                for i in range(min(2, len(pdf.pages))):
                    text += pdf.pages[i].extract_text() or ""
        elif file_ext == '.docx':
            text = extract_text_from_docx(filepath)
            
        if len(text) < 50:
            print(f"    [SKIP] Text too short.")
            count_fail += 1
            continue
            
    except Exception as e:
        print(f"    [ERROR] Reading file: {e}")
        count_fail += 1
        continue

    # 3. GET DATE
    date_str = extract_latest_year_heuristic(text)
    if not date_str:
         date_str = get_os_creation_date(filepath)
         print(f"    [Fallback] Using OS Date: {date_str}")

    # 4. GET NAME/BG
    # Add a tiny delay to give Ollama a breather between files
    time.sleep(0.5)
    llm_output = ask_ollama(clean_text_for_llm(text))
    
    name = None
    bg = "General"

    if llm_output:
        if "|" in llm_output:
            parts = llm_output.split('|', 1)
            name = parts[0].strip()
            bg = parts[1].strip()
        elif "\n" in llm_output:
            lines = [line.strip() for line in llm_output.split('\n') if line.strip()]
            if len(lines) >= 2:
                name = lines[0]
                bg = lines[1]
        
        # --- IMPROVED FALLBACK CHECK ---
        forbidden_names = ["name", "unknown", "resume", "applicant", "candidate", "full name"]
        if not name or name.strip().lower() in forbidden_names:
            name = get_name_fallback(text)
        # -------------------------------

        if name:
            name = clean_extracted_string(name)
            bg = clean_extracted_string(bg)
            
            safe_name = re.sub(r'[^\w\s-]', '', name)
            safe_bg = re.sub(r'[^\w\s-]', '', bg)
            
            new_filename = f"{date_str} {safe_name} {safe_bg}{file_ext}"
            new_filepath = os.path.join(FOLDER_PATH, new_filename)
            
            if filepath != new_filepath:
                if not os.path.exists(new_filepath):
                    os.rename(filepath, new_filepath)
                    print(f"    -> Renamed: [{new_filename}]")
                    count_success += 1
                else:
                    print(f"    -> Duplicate: [{new_filename}]")
            else:
                print("    -> No change.")
        else:
            print(f"    -> AI Format Fail: {llm_output}")
            count_fail += 1
    else:
        print("    -> AI returned nothing.")
        count_fail += 1

print(f"\nDone! Renamed: {count_success} | Failed: {count_fail}")


if name == "main":
process_folder()