Jump to content

Resume Renamer 260120: Difference between revisions

From Game in the Brain Wiki
No edit summary
No edit summary
 
(9 intermediate revisions by the same user not shown)
Line 1: Line 1:
== 1. The Problem ==
Students and applicants rarely follow file naming conventions. You likely have a folder that looks like this:


= 📂 Automated Resume Renamer & Organizer =
Resume.pdf
'''For Ubuntu 24.04 using Local AI (Ollama)'''


== 1. The Problem ==
CV_Final_v2.docx
As an HR officer or Professor, you know that students and applicants rarely follow file naming conventions. You likely have a folder that looks like this:


* Resume.pdf
MyResume(1).pdf
* CV_Final_v2.docx
* MyResume(1).pdf
* john_doe.pdf


'''The Goal:''' Automatically rename these files based on their '''content''' to a standard format:
john_doe.pdf


YYMMDD Name Degree/Background.pdf
This makes sorting by date or qualification impossible without opening every single file.


Example: 250101 Juan Dela Cruz BS Information Technology.pdf
'''The Goal:''' Automatically rename these files based on their '''content''' to a standard format:
: YYMMDD Name Degree/Background.pdf
: ''Example:'' 250101 Juan Dela Cruz BS Information Technology.pdf


== 2. Requirements Checklist ==
== 2. Requirements Checklist ==
Please ensure you have the following ready before starting.
Please ensure you have the following ready before starting.


* [ ] '''Ubuntu 24.04''' System (Updated).
[ ] '''Ubuntu 24.04''' System.
* [ ] '''Python 3.12+''' (Pre-installed on Ubuntu 24.04).
 
* [ ] '''Ollama''' installed locally (The AI engine).
[ ] '''Python 3.12+''' (Pre-installed on Ubuntu 24.04).
* [ ] '''A Small Language Model''' pulled (e.g., granite3.3:2b or llama3.2).
 
* [ ] '''Python Libraries:''' pdfplumber (for PDFs), python-docx (for Word), requests (to talk to Ollama).
[ ] '''Ollama''' installed locally (The AI engine).
* [ ] '''No Images:''' The files must have '''embedded text'''. This script excludes OCR (Optical Character Recognition) to keep it fast and lightweight. Scanned images will be skipped.
 
[ ] '''A Small Language Model''' pulled (e.g., granite3.3:2b or llama3.2).
*: ''Note: Small models are fast but can make mistakes. The script has logic to catch these, but a human review is always recommended.''
 
[ ] '''Python Libraries:''' pdfplumber (for PDFs), python-docx (for Word), requests (to talk to Ollama).
 
[ ] '''No Images:''' The files must have '''embedded text'''. This script excludes OCR (Optical Character Recognition) to keep it fast and lightweight. Pure image scans will be skipped.


== 3. How the Script Works (The Logic) ==
== 3. How the Script Works (The Logic) ==
This script acts as a "Project Manager" that hires two distinct specialists to process each file. It does not blindly ask the AI for everything, as small AIs make mistakes with math and dates.
This script acts as a "Project Manager" that hires two distinct specialists to process each file. It does not blindly ask the AI for everything, as small AIs make mistakes with math and dates.


# '''File Discovery:'''
'''File Discovery:'''
 
* The script looks for .pdf and .docx files in the folder where the script is located.


# '''Text Extraction:'''
#* The script looks for .pdf and .docx files in the folder where the script is located.


* It pulls raw text. If the text is less than 50 characters (likely an image scan), it skips the file to prevent errors.
'''Text Extraction:'''


# '''The Date Specialist (Python Regex):'''
#* It pulls raw text. If the text is less than 50 characters (likely an image scan), it skips the file.


* '''Logic:''' It scans the text for '''explicit years''' (e.g., "2023", "2024").
'''The Date Specialist (Python Regex):'''
* '''Rule:''' It ignores the word "Present". Why? If a resume from 2022 says "2022 - Present", treating "Present" as "Today" (2026) would incorrectly date the old resume. We stick to the highest ''printed'' number.
* '''Output:''' Sets the date to Jan 1st of the highest year found (e.g., 240101).


# '''The Content Specialist (Ollama AI):'''
#* '''Logic:''' It scans the text for '''explicit years''' (e.g., "2023", "2024").
#* '''Rule:''' It ignores the word "Present". Why? If a resume from 2022 says "2022 - Present", treating "Present" as "Today" (2026) would incorrectly date the old resume. We stick to the highest printed number.
#* '''Output:''' Sets the date to Jan 1st of the highest year found (e.g., 240101).


* '''Logic:''' It sends the text to the local AI with strict instructions.
'''The Content Specialist (Ollama AI):'''
* '''Rule 1 (Priority):''' It looks for a '''Degree''' (e.g., "BS IT") first. It is forbidden from using "Intern" or "Student" if a degree is found.
* '''Rule 2 (Fallback):''' If the AI fails to find a name, the script grabs the first line of the document as a fallback.


# '''Sanitization & Renaming:'''
#* '''Logic:''' It sends the text to the local AI with strict instructions.
#* '''Rule 1 (Priority):''' It looks for a '''Degree''' (e.g., "BS IT") first. It is forbidden from using "Intern" or "Student" if a degree is found.
#* '''Rule 2 (Fallback):''' If the AI fails to find a name, the script grabs the first line of the document as a fallback.


* It fixes "Spaced Names" (e.g., J O H N -> John).
'''Sanitization & Renaming:'''
* It ensures the filename isn't too long.


It renames the file only if the name doesn't already exist.
#* It fixes "Spaced Names" (e.g., J O H N -> John).
#* It ensures the filename isn't too long.
#* It renames the file only if the name doesn't already exist.


== 4. Installation Guide (Ubuntu 24.04) ==
== 4. Installation Guide (Ubuntu 24.04) ==
Line 63: Line 67:
Ensure your system tools are fresh to avoid installation conflicts.
Ensure your system tools are fresh to avoid installation conflicts.


<syntaxhighlight lang="bash">
sudo apt update && sudo apt upgrade -y
sudo apt update && sudo apt upgrade -y
</syntaxhighlight>


=== Step B: Install Ollama & The Model ===
=== Step B: Install Ollama & The Model ===


# '''Install the Ollama Engine:'''
'''Install the Ollama Engine:'''
 
#:<syntaxhighlight lang="bash">curl -fsSL https://ollama.com/install.sh | sh</syntaxhighlight>
 
'''Download the Brain (The Model):'''
 
#:We use granite3.3:2b because it is very fast.
#:<syntaxhighlight lang="bash">ollama pull granite3.3:2b</syntaxhighlight>
 
=== Step C: Setup Python Environment ===
Ubuntu 24.04 requires Virtual Environments (venv) for Python scripts.
 
'''Create a Project Folder:'''


curl -fsSL <nowiki>[https://ollama.com/install.sh]</nowiki>(<nowiki>https://ollama.com/install.sh</nowiki>) | sh
#:<syntaxhighlight lang="bash">
mkdir ~/resume_renamer
cd ~/resume_renamer
</syntaxhighlight>


# Download the Brain (The Model): We use granite3.3:2b because it is very fast and follows formatting rules well.
'''Create the Virtual Environment:'''


ollama pull granite3.3:2b
#:<syntaxhighlight lang="bash">python3 -m venv venv</syntaxhighlight>


# ''(Note: You can swap this for llama3 if you have a powerful computer, but Granite is sufficient for this task).''
'''Activate the Environment:'''


=== Step C: Setup Python Environment ===
#:<syntaxhighlight lang="bash">source venv/bin/activate</syntaxhighlight>
Ubuntu 24.04 requires Virtual Environments (venv) for Python scripts to prevent breaking system tools.
#:(You should see (venv) at the start of your command line now).
 
'''Install Required Libraries:'''
 
#:<syntaxhighlight lang="bash">pip install requests pdfplumber python-docx</syntaxhighlight>
 
=== Step D: Create the Script ===


# '''Create a Project Folder:'''
Create the python file:


mkdir ~/resume_renamer
#:<syntaxhighlight lang="bash">nano rename_resumes.py</syntaxhighlight>


cd ~/resume_renamer
'''Paste the Python code''' provided in the appendix below.


# '''Create the Virtual Environment:'''
Save and exit: Press Ctrl+O, Enter, then Ctrl+X.


python3 -m venv venv
== 5. Running the Renamer ==
This script is '''portable'''. It works on the files sitting next to it.


# '''Activate the Environment:'''
'''Copy the Script:''' Move the rename_resumes.py file into your folder full of PDFs (e.g., ~/Documents/Student_CVs).


source venv/bin/activate
'''Open Terminal in that folder:'''


# ''(You should see (venv) at the start of your command line now).''
#:<syntaxhighlight lang="bash">cd ~/Documents/Student_CVs</syntaxhighlight>
# '''Install Required Libraries:'''


pip install requests pdfplumber python-docx
'''Activate your Python Environment (Point to where you created it):'''


=== Step D: Create the Script ===
#:<syntaxhighlight lang="bash">source ./venv/bin/activate</syntaxhighlight>
 
'''Run the script:'''


# Create the python file:
#:<syntaxhighlight lang="bash">python3 rename_resumes.py</syntaxhighlight>


nano rename_resumes.py
== 6. Common Errors & Troubleshooting ==
{| class="wikitable"
! Error / Behavior !! Why it happens !! The Fix (Included in Script)
|-
| '''"Intern" instead of "Degree"''' || The Resume had "INTERN" in big bold letters. || The script's prompt explicitly forbids "Intern" if a Degree is found.
|-
| '''Wrong Date (e.g., 260101)''' || The resume said "2021-Present" and the script assumed "Present" = 2026. || We disabled "Present" logic. It now only trusts explicit numbers (e.g., 2021).
|-
| '''Spaced Names (J O H N)''' || PDF formatting added spaces between letters. || A Regex function detects single letters + spaces and collapses them.
|-
| '''Script Freezes''' || Ollama is overwhelmed. || We added a 60-second timeout and a 0.5s pause between files.
|-
| '''Skipped Files''' || The PDF is a scanned image (no text). || This is intended. You need an OCR tool for these (not included here).
|}


# '''Paste the code found in the file block below.''' 
== Appendix: The Python Script ==
======== START CODE
<pre>import osimport requestsimport jsonimport pdfplumberimport refrom datetime import datetimeimport time--- OPTIONAL DEPENDENCY: python-docx ---DOCX_AVAILABLE = Falsetry:from docx import DocumentDOCX_AVAILABLE = Trueexcept ImportError:print("Warning: 'python-docx' not found. .docx files will be skipped.")print("To support Word docs, run: pip install python-docx")--- CONFIGURATION ---FOLDER_PATH = os.path.dirname(os.path.abspath(file))You can change this to "llama3" or "mistral" if installedOLLAMA_MODEL = "granite3.3:2b"---------------------def get_os_creation_date(filepath):"""Last resort: Gets OS file creation date in YYMMDD format."""try:timestamp = os.path.getctime(filepath)return datetime.fromtimestamp(timestamp).strftime('%y%m%d')except:return datetime.now().strftime('%y%m%d')def extract_latest_year_heuristic(text):"""Scans for years (2000-2059), including spaced years (2 0 2 4).Returns the HIGHEST year found."""current_year = datetime.now().yearfound_years = []# 1. Standard Years (e.g., &quot;2024&quot;, &quot;2023-2024&quot;)
matches_standard = re.findall(r&#39;(?&lt;!\d)(20[0-5][0-9])(?!\d)&#39;, text)
if matches_standard:
    found_years.extend([int(y) for y in matches_standard])


# 2. Spaced Years (e.g., &quot;2 0 2 4&quot;)
=== Rename Resumes Script ===
matches_spaced = re.findall(r&#39;(?&lt;!\d)2\s+0\s+[0-5]\s+[0-9](?!\d)&#39;, text)
Copy the code below into rename_resumes.py.
if matches_spaced:
    for m in matches_spaced:
        clean_year = int(m.replace(&quot; &quot;, &quot;&quot;))
        found_years.append(clean_year)


if found_years:
<syntaxhighlight lang="python">
     valid_years = [y for y in found_years if y &lt;= current_year + 5]
# --- IMPROVED FUNCTION: SMART PDF READER (Skips Forms & Signature Pages) ---
def get_smart_pdf_text(filepath):
     """
    Reads PDF pages but SKIPS pages that look like 'Application Forms'.
    Returns the text of the first 2 'valid' resume pages found.
    """
    valid_text = ""
    pages_read = 0
      
      
     if valid_years:
     # Phrases that indicate a page is a FORM, not a Resume
         latest_year = max(valid_years)
    skip_phrases = [
         short_year = str(latest_year)[2:]
        "APPLICATION FOR EMPLOYMENT",
         return f&quot;{short_year}0101&quot;
        "OFFICIAL USE ONLY",
        "DO NOT WRITE BELOW THIS LINE",
         "PERSONAL DATA SHEET",
         "APPLICANT'S SIGNATURE",  # Found on Page 2 of your file
         "FAMILY BACKGROUND"        # Found on Page 2 of your file
    ]


return None
    try:
def extract_text_from_docx(filepath):"""Reads text from .docx files, including tables."""if not DOCX_AVAILABLE:return ""try:doc = Document(filepath)full_text = []for para in doc.paragraphs:full_text.append(para.text)for table in doc.tables:for row in table.rows:for cell in row.cells:full_text.append(cell.text)return "\n".join(full_text)except Exception as e:print(f"[ERROR] Reading DOCX: {e}")return ""def clean_text_for_llm(text):clean = " ".join(text.split())# Limit to 4000 chars to prevent choking small modelsreturn clean[:4000]def ask_ollama(text):system_instruction = ("You are a data extraction assistant. ""Extract the applicant's Full Name and Background.""\n\nBackground Extraction Rules (STRICT):\n""1. MANDATORY: You MUST prefer the Educational Degree over any job title.\n""  - Example: If text says 'IT Intern' AND 'Diploma in Information Technology', output 'Diploma in Information Technology'.\n""  - Example: If text says 'Mechanical Engineering Student', output 'Diploma in Mechanical Engineering' (if listed) or 'Mechanical Engineering'.\n""2. FORBIDDEN: Do NOT use 'Intern', 'Student', 'Assistant', or 'Worker' as the background unless NO degree is mentioned.\n""\nOutput strictly in this format: Name | Background.""\nDo NOT include notes, explanations, or numbered lists.")prompt = f&quot;Resume Text:\n{text}\n\n{system_instruction}&quot;
        with pdfplumber.open(filepath) as pdf:
            for page in pdf.pages:
                text = page.extract_text() or ""
               
                # CHECK: Is this page just a form?
                # We check if ANY of the skip phrases appear in the text
                is_form = any(phrase in text.upper() for phrase in skip_phrases)
               
                if is_form:
                    print(f"    [INFO] Skipped a 'Form' page (found key phrase)...")
                    continue  # Skip this page, check the next one
               
                # If not a form, it's likely the resume. Keep it.
                valid_text += text + "\n"
                pages_read += 1
               
                # Stop after finding 2 valid pages of resume content
                if pages_read >= 2:
                    break
                   
    except Exception as e:
        print(f"   [ERROR] PDF Read Error: {e}")
        return ""
       
    return valid_text
# --------------------------------------
</syntaxhighlight>


url = &quot;http://localhost:11434/api/generate&quot;
=== Ocr Converter Script ===
data = {
Copy the code below into ocr_converter.py. Of course the Renamer doesnt work with Image PDFs, so you have to convert this. Also this is only as good as the VISION model used. <syntaxhighlight lang="bash">python3 ocr_converter.py</syntaxhighlight><syntaxhighlight lang="python">
    &quot;model&quot;: OLLAMA_MODEL,
import os
    &quot;prompt&quot;: prompt,
import subprocess
     &quot;stream&quot;: False,
import pdfplumber
     &quot;options&quot;: {
 
         &quot;temperature&quot;: 0.1,  
# Configuration
        &quot;num_ctx&quot;: 4096
FOLDER_PATH = "."  # Current folder
     }
MIN_TEXT_LENGTH = 50  # If text is less than this, we assume it's an image
}
 
def has_embedded_text(file_path):
     """Checks if a PDF already has text."""
     try:
         with pdfplumber.open(file_path) as pdf:
            full_text = ""
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    full_text += text
           
            # If we found enough text, return True
            if len(full_text.strip()) > MIN_TEXT_LENGTH:
                return True
     except Exception as e:
        print(f"Error reading {file_path}: {e}")
        return False
    return False


try:
def ocr_file(file_path):
     # Added timeout to prevent hanging on one file
     """Runs OCRmyPDF on the file."""
    response = requests.post(url, json=data, timeout=60)
     output_path = file_path.replace(".pdf", "_OCR.pdf")
    response.raise_for_status()
     result = response.json()[&#39;response&#39;].strip()
    return result
except Exception as e:
    print(f&quot;    [Warning] Ollama call failed: {e}&quot;)
    return None
def fix_spaced_names(text):# Fixes "J O H N" -> "JOHN"return re.sub(r'(?<=\b[A-Za-z])\s+(?=[A-Za-z]\b)', '', text)def clean_extracted_string(s):# Remove lists (1.), labels (Name:), and fix spacings = re.sub(r'^(1.|2.|Name:|Background:|\d\W)', '', s, flags=re.IGNORECASE)s = fix_spaced_names(s)s = s.split('\n')[0]s = re.split(r'(?i)note\s*:', s)[0]# Truncate to safe filename length
if len(s) &gt; 60:
    s = s[:60].strip()
      
      
return s.strip().title()
    # Don't re-OCR if the output already exists
def get_name_fallback(text):"""If AI returns 'Name' or 'Unknown', this function grabs thefirst non-empty line of the resume, which is usually the name."""lines = [line.strip() for line in text.split('\n') if line.strip()]ignore_list = [&#39;resume&#39;, &#39;curriculum vitae&#39;, &#39;cv&#39;, &#39;profile&#39;, &#39;bio&#39;, &#39;page&#39;, &#39;summary&#39;, &#39;objective&#39;, &#39;name&#39;, &#39;contact&#39;]
    if os.path.exists(output_path):
        print(f"Skipping {file_path} (OCR version already exists)")
        return


for line in lines:
    print(f"🖼️  Image Detected: Converting {file_path}...")
    lower_line = line.lower()
    if len(line) &lt; 3 or any(w in lower_line for w in ignore_list):
        continue
      
      
     word_count = len(line.split())
     try:
     if word_count &gt; 5: continue # Names rarely have &gt;5 words
        # Run the OCR command
    if &quot;looking for&quot; in lower_line or &quot;seeking&quot; in lower_line: continue
        # --force-ocr: Process even if it thinks there is some text (often garbage in scans)
        # --deskew: Straighten crooked scans
        command = [
            "ocrmypdf",
            "--force-ocr",
            "--deskew",
            file_path,
            output_path
        ]
       
        result = subprocess.run(command, capture_output=True, text=True)
       
        if result.returncode == 0:
            print(f"✅ Success: Created {output_path}")
        else:
            print(f"❌ Failed to OCR {file_path}")
            print(result.stderr)
           
     except FileNotFoundError:
        print("❌ Error: 'ocrmypdf' is not installed. Run 'sudo apt install ocrmypdf' first.")


     if len(line) &lt; 50 and not re.search(r&#39;[0-9!@#$%^&amp;*()_+={};&quot;&lt;&gt;?]&#39;, line):
def main():
         print(f&quot;    [Fallback] AI failed. Guessed name from first line: {line}&quot;)
    print("🔍 Scanning for image-based PDFs...")
        return line
     files = [f for f in os.listdir(FOLDER_PATH) if f.lower().endswith(".pdf") and "_OCR" not in f]
   
    count = 0
    for filename in files:
         file_path = os.path.join(FOLDER_PATH, filename)
          
          
return &quot;Unknown Applicant&quot;
        if not has_embedded_text(file_path):
def process_folder():print(f"--- Resume Renamer (Strict Degree Priority + Resilient) ---")print(f"Working in: {FOLDER_PATH}\n")count_success = 0
            ocr_file(file_path)
count_fail = 0
            count += 1
script_name = os.path.basename(__file__)
           
    if count == 0:
        print("🎉 No image-only PDFs found. All files differ have text!")
    else:
        print(f"\n✨ Processed {count} files.")
 
if __name__ == "__main__":
    main()
</syntaxhighlight>
 
=== PDF 2 VCF Script ===
Copy the code below into pdf2vcf.py. This creates a bulk VCF file so you can load this into your contacts. <syntaxhighlight lang="bash">python3 pdf2vcf.py</syntaxhighlight><syntaxhighlight lang="python">
import os
import requests
import json
import pdfplumber
import re
from datetime import datetime
import time


for filename in os.listdir(FOLDER_PATH):
# --- CONFIGURATION ---
     # 1. Check Extension
FOLDER_PATH = os.path.dirname(os.path.abspath(__file__))
     file_ext = os.path.splitext(filename)[1].lower()
OLLAMA_MODEL = "granite3.3:2b"
     if filename == script_name:
# ---------------------
        continue
 
def get_timestamp():
    """Returns current YYMMDD-HHMMSS"""
    return datetime.now().strftime('%y%m%d-%H%M%S')
 
def get_short_date():
     """Returns current YYMMDD"""
    return datetime.now().strftime('%y%m%d')
 
# --- SMART PDF READER ---
def get_smart_pdf_text(filepath):
    """
    Reads PDF pages but SKIPS pages that look like 'Application Forms'.
    Returns the text of the first 2 'valid' resume pages found.
     """
    valid_text = ""
    pages_read = 0
    skip_phrases = [
        "APPLICATION FOR EMPLOYMENT", "OFFICIAL USE ONLY",
        "DO NOT WRITE BELOW THIS LINE", "PERSONAL DATA SHEET",
        "APPLICANT'S SIGNATURE", "FAMILY BACKGROUND"
    ]
 
    try:
        with pdfplumber.open(filepath) as pdf:
            for page in pdf.pages:
                text = page.extract_text() or ""
                # CHECK: Is this page just a form?
                if any(phrase in text.upper() for phrase in skip_phrases):
                    continue
               
                valid_text += text + "\n"
                pages_read += 1
                if pages_read >= 2: break   
    except Exception as e:
        print(f"    [ERROR] PDF Read Error: {e}")
        return ""
    return valid_text
 
def clean_text_for_llm(text):
    clean = " ".join(text.split())
     return clean[:6000]
 
def parse_name_from_filename(filename):
    """
    Fallback: Tries to guess the name from a filename like '260101 Kim Ong Diploma.pdf'
    """
    # Remove extension
    base = os.path.splitext(filename)[0]
      
      
     if file_ext == &#39;.docx&#39; and not DOCX_AVAILABLE:
     # Regex: Look for 6 digits at start, then text
         continue
    match = re.search(r'^\d{6}\s+(.*?)\s+(?:Bachelor|Diploma|Certificate|General|Master|PhD|Associate|Engineer|Architect)', base, re.IGNORECASE)
    if match:
         return match.group(1).strip()
      
      
     if file_ext not in [&#39;.pdf&#39;, &#39;.docx&#39;]:
    # Weaker Regex: Just take the first 3 words after the date
        continue
    match_weak = re.search(r'^\d{6}\s+([A-Za-z-]+\s+[A-Za-z-]+\s?[A-Za-z-]*)', base)
     if match_weak:
        return match_weak.group(1).strip()
 
    return None
 
def ask_ollama_extraction(text, filename):
    """
    Asks LLM to extract specific fields, using the FILENAME as a hint.
    """
    system_instruction = (
        "You are a Data Extraction Expert. Extract details from the resume.\n"
        f"CONTEXT: The file is named '{filename}'. This filename likely contains the correct spelling of the Name and Degree.\n"
        "\nRULES:\n"
        "1. **Double Check the Name:** If the resume text has OCR errors (e.g., 'K1m 0ng'), use the spelling from the Filename ('Kim Ong').\n"
        "2. **Extract:** Full Name, Educational Degree (Short), Email, Phone, and Summary.\n"
        "3. **Summary:** Write a concise 3-sentence summary of their key skills.\n"
        "\nRETURN JSON ONLY:\n"
        "{\n"
        '  "name": "John Doe",\n'
        '  "degree": "BS IT",\n'
        '  "email": "john@email.com",\n'
        '  "phone": "09123456789",\n'
        '  "summary": "Experienced in..."\n'
        "}"
    )
 
    prompt = f"Resume Text:\n{text}\n\n{system_instruction}"
 
    url = "http://localhost:11434/api/generate"
    data = {
        "model": OLLAMA_MODEL,
        "prompt": prompt,
        "stream": False,
        "format": "json",  
        "options": {"temperature": 0.1, "num_ctx": 4096}
    }


    filepath = os.path.join(FOLDER_PATH, filename)
    text = &quot;&quot;
   
    # 2. Extract Text
    print(f&quot;Processing: {filename}...&quot;)
     try:
     try:
         if file_ext == &#39;.pdf&#39;:
         response = requests.post(url, json=data, timeout=60)
            with pdfplumber.open(filepath) as pdf:
        response.raise_for_status()
                for i in range(min(2, len(pdf.pages))):
         result_json = response.json()['response']
                    text += pdf.pages[i].extract_text() or &quot;&quot;
         return json.loads(result_json)
         elif file_ext == &#39;.docx&#39;:
            text = extract_text_from_docx(filepath)
           
         if len(text) &lt; 50:
            print(f&quot;    [SKIP] Text too short.&quot;)
            count_fail += 1
            continue
           
     except Exception as e:
     except Exception as e:
         print(f&quot;   [ERROR] Reading file: {e}&quot;)
         print(f"   [Warning] AI Extraction failed: {e}")
         count_fail += 1
         return None
        continue


     # 3. GET DATE
def create_vcard_string(data, creation_date):
     date_str = extract_latest_year_heuristic(text)
    """
     if not date_str:
     Formats the data into VCF 3.0 format.
        date_str = get_os_creation_date(filepath)
    Format: Name Degree YYMMDD (All in First Name field for easy searching)
        print(f&quot;    [Fallback] Using OS Date: {date_str}&quot;)
    """
    name = data.get("name", "Unknown")
     degree = data.get("degree", "")
     email = data.get("email", "")
    phone = data.get("phone", "")
    summary = data.get("summary", "")


     # 4. GET NAME/BG
     # Sanitize inputs
     # Add a tiny delay to give Ollama a breather between files
     if not name or name == "Unknown":
     time.sleep(0.5)
        name = "Unknown Candidate"
     llm_output = ask_ollama(clean_text_for_llm(text))
      
     complex_name = f"{name} {degree} {creation_date}".strip()
      
      
     name = None
     vcf = [
     bg = &quot;General&quot;
        "BEGIN:VCARD",
        "VERSION:3.0",
        f"N:;{complex_name};;;",
        f"FN:{complex_name}",
        f"TEL;TYPE=CELL:{phone}",
        f"EMAIL;TYPE=WORK:{email}",
        f"NOTE:{summary} (Extracted via AI)",
        f"REV:{datetime.now().isoformat()}",
        "END:VCARD"
    ]
    return "\n".join(vcf) + "\n"
 
def process_to_vcf():
     output_filename = f"{get_timestamp()}_Bulk_Import.vcf"
    output_path = os.path.join(FOLDER_PATH, output_filename)
    creation_date = get_short_date()


     if llm_output:
     print(f"--- Smart Resume to VCF Exporter ---")
        if &quot;|&quot; in llm_output:
    print(f"Target Output: {output_filename}")
            parts = llm_output.split(&#39;|&#39;, 1)
   
            name = parts[0].strip()
    count = 0
            bg = parts[1].strip()
   
        elif &quot;\n&quot; in llm_output:
    with open(output_path, "w", encoding="utf-8") as vcf_file:
            lines = [line.strip() for line in llm_output.split(&#39;\n&#39;) if line.strip()]
            if len(lines) &gt;= 2:
                name = lines[0]
                bg = lines[1]
          
          
         # --- IMPROVED FALLBACK CHECK ---
         for filename in os.listdir(FOLDER_PATH):
        forbidden_names = [&quot;name&quot;, &quot;unknown&quot;, &quot;resume&quot;, &quot;applicant&quot;, &quot;candidate&quot;, &quot;full name&quot;]
            if not filename.lower().endswith(".pdf"):
        if not name or name.strip().lower() in forbidden_names:
                continue
            name = get_name_fallback(text)
        # -------------------------------


        if name:
             filepath = os.path.join(FOLDER_PATH, filename)
             name = clean_extracted_string(name)
             print(f"Processing: {filename}...")
            bg = clean_extracted_string(bg)
 
           
             # 1. Get Text
            safe_name = re.sub(r&#39;[^\w\s-]&#39;, &#39;&#39;, name)
             text = get_smart_pdf_text(filepath)
             safe_bg = re.sub(r&#39;[^\w\s-]&#39;, &#39;&#39;, bg)
            if len(text) < 50:
           
                print("   [SKIP] Text too short/unreadable.")
            new_filename = f&quot;{date_str} {safe_name} {safe_bg}{file_ext}&quot;
                 continue
            new_filepath = os.path.join(FOLDER_PATH, new_filename)
 
              
             # 2. Extract Data (Passing filename for context)
             if filepath != new_filepath:
             time.sleep(0.5)  
                if not os.path.exists(new_filepath):
             data = ask_ollama_extraction(clean_text_for_llm(text), filename)
                    os.rename(filepath, new_filepath)
                    print(f&quot;   -&gt; Renamed: [{new_filename}]&quot;)
                    count_success += 1
                 else:
                    print(f&quot;    -&gt; Duplicate: [{new_filename}]&quot;)
             else:
                print(&quot;    -&gt; No change.&quot;)
        else:
             print(f&quot;    -&gt; AI Format Fail: {llm_output}&quot;)
             count_fail += 1
    else:
        print(&quot;    -&gt; AI returned nothing.&quot;)
        count_fail += 1


print(f&quot;\nDone! Renamed: {count_success} | Failed: {count_fail}&quot;)
            if data:
if name == "main":process_folder()</pre>
                # 3. Double Check Name (Python Logic Fallback)
                # If AI gave a bad name, or "Unknown", try to grab it from the filename manually
                ai_name = data.get("name", "")
                if not ai_name or "unknown" in ai_name.lower() or any(char.isdigit() for char in ai_name):
                    fallback_name = parse_name_from_filename(filename)
                    if fallback_name:
                        print(f"    [Correction] Replaced '{ai_name}' with filename name: '{fallback_name}'")
                        data['name'] = fallback_name


====== ======== END CODE ======
                # 4. Create VCard Block
# Save and exit: Press Ctrl+O, Enter, then Ctrl+X.
                vcard_block = create_vcard_string(data, creation_date)
                vcf_file.write(vcard_block)
                print(f"    -> Added: {data.get('name')} ({data.get('degree')})")
                count += 1
            else:
                print("    -> Failed to extract data.")


== 5. Running the Renamer ==
    print(f"\nDone! Created {output_filename} with {count} contacts.")
This script is '''portable'''. It works on the files sitting next to it.


# '''Copy the Script:''' Move the rename_resumes.py file into your folder full of PDFs (e.g., ~/Documents/Student_CVs).
if __name__ == "__main__":
# '''Open Terminal in that folder:'''  cd ~/Documents/Student_CVs
    process_to_vcf()
# '''Activate your Python Environment (Point to where you created it):'''  source ~/resume_renamer/venv/bin/activate
</syntaxhighlight>
# '''Run the script:'''  python3 rename_resumes.py

Latest revision as of 09:14, 4 February 2026

1. The Problem

Students and applicants rarely follow file naming conventions. You likely have a folder that looks like this:

Resume.pdf

CV_Final_v2.docx

MyResume(1).pdf

john_doe.pdf

This makes sorting by date or qualification impossible without opening every single file.

The Goal: Automatically rename these files based on their content to a standard format:

YYMMDD Name Degree/Background.pdf
Example: 250101 Juan Dela Cruz BS Information Technology.pdf

2. Requirements Checklist

Please ensure you have the following ready before starting.

[ ] Ubuntu 24.04 System.

[ ] Python 3.12+ (Pre-installed on Ubuntu 24.04).

[ ] Ollama installed locally (The AI engine).

[ ] A Small Language Model pulled (e.g., granite3.3:2b or llama3.2).

  • Note: Small models are fast but can make mistakes. The script has logic to catch these, but a human review is always recommended.

[ ] Python Libraries: pdfplumber (for PDFs), python-docx (for Word), requests (to talk to Ollama).

[ ] No Images: The files must have embedded text. This script excludes OCR (Optical Character Recognition) to keep it fast and lightweight. Pure image scans will be skipped.

3. How the Script Works (The Logic)

This script acts as a "Project Manager" that hires two distinct specialists to process each file. It does not blindly ask the AI for everything, as small AIs make mistakes with math and dates.

File Discovery:

    • The script looks for .pdf and .docx files in the folder where the script is located.

Text Extraction:

    • It pulls raw text. If the text is less than 50 characters (likely an image scan), it skips the file.

The Date Specialist (Python Regex):

    • Logic: It scans the text for explicit years (e.g., "2023", "2024").
    • Rule: It ignores the word "Present". Why? If a resume from 2022 says "2022 - Present", treating "Present" as "Today" (2026) would incorrectly date the old resume. We stick to the highest printed number.
    • Output: Sets the date to Jan 1st of the highest year found (e.g., 240101).

The Content Specialist (Ollama AI):

    • Logic: It sends the text to the local AI with strict instructions.
    • Rule 1 (Priority): It looks for a Degree (e.g., "BS IT") first. It is forbidden from using "Intern" or "Student" if a degree is found.
    • Rule 2 (Fallback): If the AI fails to find a name, the script grabs the first line of the document as a fallback.

Sanitization & Renaming:

    • It fixes "Spaced Names" (e.g., J O H N -> John).
    • It ensures the filename isn't too long.
    • It renames the file only if the name doesn't already exist.

4. Installation Guide (Ubuntu 24.04)

Open your terminal (Ctrl+Alt+T) and follow these steps exactly.

Step A: System Update

Ensure your system tools are fresh to avoid installation conflicts.

sudo apt update && sudo apt upgrade -y

Step B: Install Ollama & The Model

Install the Ollama Engine:

  1. curl -fsSL https://ollama.com/install.sh | sh
    

Download the Brain (The Model):

  1. We use granite3.3:2b because it is very fast.
    ollama pull granite3.3:2b
    

Step C: Setup Python Environment

Ubuntu 24.04 requires Virtual Environments (venv) for Python scripts.

Create a Project Folder:

  1. mkdir ~/resume_renamer
    cd ~/resume_renamer
    

Create the Virtual Environment:

  1. python3 -m venv venv
    

Activate the Environment:

  1. source venv/bin/activate
    
    (You should see (venv) at the start of your command line now).

Install Required Libraries:

  1. pip install requests pdfplumber python-docx
    

Step D: Create the Script

Create the python file:

  1. nano rename_resumes.py
    

Paste the Python code provided in the appendix below.

Save and exit: Press Ctrl+O, Enter, then Ctrl+X.

5. Running the Renamer

This script is portable. It works on the files sitting next to it.

Copy the Script: Move the rename_resumes.py file into your folder full of PDFs (e.g., ~/Documents/Student_CVs).

Open Terminal in that folder:

  1. cd ~/Documents/Student_CVs
    

Activate your Python Environment (Point to where you created it):

  1. source ./venv/bin/activate
    

Run the script:

  1. python3 rename_resumes.py
    

6. Common Errors & Troubleshooting

Error / Behavior Why it happens The Fix (Included in Script)
"Intern" instead of "Degree" The Resume had "INTERN" in big bold letters. The script's prompt explicitly forbids "Intern" if a Degree is found.
Wrong Date (e.g., 260101) The resume said "2021-Present" and the script assumed "Present" = 2026. We disabled "Present" logic. It now only trusts explicit numbers (e.g., 2021).
Spaced Names (J O H N) PDF formatting added spaces between letters. A Regex function detects single letters + spaces and collapses them.
Script Freezes Ollama is overwhelmed. We added a 60-second timeout and a 0.5s pause between files.
Skipped Files The PDF is a scanned image (no text). This is intended. You need an OCR tool for these (not included here).

Appendix: The Python Script

Rename Resumes Script

Copy the code below into rename_resumes.py.

# --- IMPROVED FUNCTION: SMART PDF READER (Skips Forms & Signature Pages) ---
def get_smart_pdf_text(filepath):
    """
    Reads PDF pages but SKIPS pages that look like 'Application Forms'.
    Returns the text of the first 2 'valid' resume pages found.
    """
    valid_text = ""
    pages_read = 0
    
    # Phrases that indicate a page is a FORM, not a Resume
    skip_phrases = [
        "APPLICATION FOR EMPLOYMENT", 
        "OFFICIAL USE ONLY", 
        "DO NOT WRITE BELOW THIS LINE",
        "PERSONAL DATA SHEET",
        "APPLICANT'S SIGNATURE",   # Found on Page 2 of your file
        "FAMILY BACKGROUND"        # Found on Page 2 of your file
    ]

    try:
        with pdfplumber.open(filepath) as pdf:
            for page in pdf.pages:
                text = page.extract_text() or ""
                
                # CHECK: Is this page just a form?
                # We check if ANY of the skip phrases appear in the text
                is_form = any(phrase in text.upper() for phrase in skip_phrases)
                
                if is_form:
                    print(f"    [INFO] Skipped a 'Form' page (found key phrase)...")
                    continue  # Skip this page, check the next one
                
                # If not a form, it's likely the resume. Keep it.
                valid_text += text + "\n"
                pages_read += 1
                
                # Stop after finding 2 valid pages of resume content
                if pages_read >= 2:
                    break
                    
    except Exception as e:
        print(f"    [ERROR] PDF Read Error: {e}")
        return ""
        
    return valid_text
# --------------------------------------

Ocr Converter Script

Copy the code below into ocr_converter.py. Of course the Renamer doesnt work with Image PDFs, so you have to convert this. Also this is only as good as the VISION model used.

python3 ocr_converter.py
import os
import subprocess
import pdfplumber

# Configuration
FOLDER_PATH = "."  # Current folder
MIN_TEXT_LENGTH = 50  # If text is less than this, we assume it's an image

def has_embedded_text(file_path):
    """Checks if a PDF already has text."""
    try:
        with pdfplumber.open(file_path) as pdf:
            full_text = ""
            for page in pdf.pages:
                text = page.extract_text()
                if text:
                    full_text += text
            
            # If we found enough text, return True
            if len(full_text.strip()) > MIN_TEXT_LENGTH:
                return True
    except Exception as e:
        print(f"Error reading {file_path}: {e}")
        return False
    return False

def ocr_file(file_path):
    """Runs OCRmyPDF on the file."""
    output_path = file_path.replace(".pdf", "_OCR.pdf")
    
    # Don't re-OCR if the output already exists
    if os.path.exists(output_path):
        print(f"Skipping {file_path} (OCR version already exists)")
        return

    print(f"🖼️  Image Detected: Converting {file_path}...")
    
    try:
        # Run the OCR command
        # --force-ocr: Process even if it thinks there is some text (often garbage in scans)
        # --deskew: Straighten crooked scans
        command = [
            "ocrmypdf", 
            "--force-ocr", 
            "--deskew", 
            file_path, 
            output_path
        ]
        
        result = subprocess.run(command, capture_output=True, text=True)
        
        if result.returncode == 0:
            print(f"✅ Success: Created {output_path}")
        else:
            print(f"❌ Failed to OCR {file_path}")
            print(result.stderr)
            
    except FileNotFoundError:
        print("❌ Error: 'ocrmypdf' is not installed. Run 'sudo apt install ocrmypdf' first.")

def main():
    print("🔍 Scanning for image-based PDFs...")
    files = [f for f in os.listdir(FOLDER_PATH) if f.lower().endswith(".pdf") and "_OCR" not in f]
    
    count = 0
    for filename in files:
        file_path = os.path.join(FOLDER_PATH, filename)
        
        if not has_embedded_text(file_path):
            ocr_file(file_path)
            count += 1
            
    if count == 0:
        print("🎉 No image-only PDFs found. All files differ have text!")
    else:
        print(f"\n✨ Processed {count} files.")

if __name__ == "__main__":
    main()

PDF 2 VCF Script

Copy the code below into pdf2vcf.py. This creates a bulk VCF file so you can load this into your contacts.

python3 pdf2vcf.py
import os
import requests
import json
import pdfplumber
import re
from datetime import datetime
import time

# --- CONFIGURATION ---
FOLDER_PATH = os.path.dirname(os.path.abspath(__file__))
OLLAMA_MODEL = "granite3.3:2b" 
# ---------------------

def get_timestamp():
    """Returns current YYMMDD-HHMMSS"""
    return datetime.now().strftime('%y%m%d-%H%M%S')

def get_short_date():
    """Returns current YYMMDD"""
    return datetime.now().strftime('%y%m%d')

# --- SMART PDF READER ---
def get_smart_pdf_text(filepath):
    """
    Reads PDF pages but SKIPS pages that look like 'Application Forms'.
    Returns the text of the first 2 'valid' resume pages found.
    """
    valid_text = ""
    pages_read = 0
    skip_phrases = [
        "APPLICATION FOR EMPLOYMENT", "OFFICIAL USE ONLY", 
        "DO NOT WRITE BELOW THIS LINE", "PERSONAL DATA SHEET",
        "APPLICANT'S SIGNATURE", "FAMILY BACKGROUND"
    ]

    try:
        with pdfplumber.open(filepath) as pdf:
            for page in pdf.pages:
                text = page.extract_text() or ""
                # CHECK: Is this page just a form?
                if any(phrase in text.upper() for phrase in skip_phrases):
                    continue 
                
                valid_text += text + "\n"
                pages_read += 1
                if pages_read >= 2: break     
    except Exception as e:
        print(f"    [ERROR] PDF Read Error: {e}")
        return ""
    return valid_text

def clean_text_for_llm(text):
    clean = " ".join(text.split())
    return clean[:6000]

def parse_name_from_filename(filename):
    """
    Fallback: Tries to guess the name from a filename like '260101 Kim Ong Diploma.pdf'
    """
    # Remove extension
    base = os.path.splitext(filename)[0]
    
    # Regex: Look for 6 digits at start, then text
    match = re.search(r'^\d{6}\s+(.*?)\s+(?:Bachelor|Diploma|Certificate|General|Master|PhD|Associate|Engineer|Architect)', base, re.IGNORECASE)
    if match:
        return match.group(1).strip()
    
    # Weaker Regex: Just take the first 3 words after the date
    match_weak = re.search(r'^\d{6}\s+([A-Za-z-]+\s+[A-Za-z-]+\s?[A-Za-z-]*)', base)
    if match_weak:
        return match_weak.group(1).strip()

    return None

def ask_ollama_extraction(text, filename):
    """
    Asks LLM to extract specific fields, using the FILENAME as a hint.
    """
    system_instruction = (
        "You are a Data Extraction Expert. Extract details from the resume.\n"
        f"CONTEXT: The file is named '{filename}'. This filename likely contains the correct spelling of the Name and Degree.\n"
        "\nRULES:\n"
        "1. **Double Check the Name:** If the resume text has OCR errors (e.g., 'K1m 0ng'), use the spelling from the Filename ('Kim Ong').\n"
        "2. **Extract:** Full Name, Educational Degree (Short), Email, Phone, and Summary.\n"
        "3. **Summary:** Write a concise 3-sentence summary of their key skills.\n"
        "\nRETURN JSON ONLY:\n"
        "{\n"
        '  "name": "John Doe",\n'
        '  "degree": "BS IT",\n'
        '  "email": "john@email.com",\n'
        '  "phone": "09123456789",\n'
        '  "summary": "Experienced in..."\n'
        "}"
    )

    prompt = f"Resume Text:\n{text}\n\n{system_instruction}"

    url = "http://localhost:11434/api/generate"
    data = {
        "model": OLLAMA_MODEL,
        "prompt": prompt,
        "stream": False,
        "format": "json", 
        "options": {"temperature": 0.1, "num_ctx": 4096}
    }

    try:
        response = requests.post(url, json=data, timeout=60)
        response.raise_for_status()
        result_json = response.json()['response']
        return json.loads(result_json)
    except Exception as e:
        print(f"    [Warning] AI Extraction failed: {e}")
        return None

def create_vcard_string(data, creation_date):
    """
    Formats the data into VCF 3.0 format.
    Format: Name Degree YYMMDD (All in First Name field for easy searching)
    """
    name = data.get("name", "Unknown")
    degree = data.get("degree", "")
    email = data.get("email", "")
    phone = data.get("phone", "")
    summary = data.get("summary", "")

    # Sanitize inputs
    if not name or name == "Unknown":
        name = "Unknown Candidate"
    
    complex_name = f"{name} {degree} {creation_date}".strip()
    
    vcf = [
        "BEGIN:VCARD",
        "VERSION:3.0",
        f"N:;{complex_name};;;", 
        f"FN:{complex_name}",
        f"TEL;TYPE=CELL:{phone}",
        f"EMAIL;TYPE=WORK:{email}",
        f"NOTE:{summary} (Extracted via AI)",
        f"REV:{datetime.now().isoformat()}",
        "END:VCARD"
    ]
    return "\n".join(vcf) + "\n"

def process_to_vcf():
    output_filename = f"{get_timestamp()}_Bulk_Import.vcf"
    output_path = os.path.join(FOLDER_PATH, output_filename)
    creation_date = get_short_date() 

    print(f"--- Smart Resume to VCF Exporter ---")
    print(f"Target Output: {output_filename}")
    
    count = 0
    
    with open(output_path, "w", encoding="utf-8") as vcf_file:
        
        for filename in os.listdir(FOLDER_PATH):
            if not filename.lower().endswith(".pdf"):
                continue

            filepath = os.path.join(FOLDER_PATH, filename)
            print(f"Processing: {filename}...")

            # 1. Get Text
            text = get_smart_pdf_text(filepath)
            if len(text) < 50:
                print("    [SKIP] Text too short/unreadable.")
                continue

            # 2. Extract Data (Passing filename for context)
            time.sleep(0.5) 
            data = ask_ollama_extraction(clean_text_for_llm(text), filename)

            if data:
                # 3. Double Check Name (Python Logic Fallback)
                # If AI gave a bad name, or "Unknown", try to grab it from the filename manually
                ai_name = data.get("name", "")
                if not ai_name or "unknown" in ai_name.lower() or any(char.isdigit() for char in ai_name):
                    fallback_name = parse_name_from_filename(filename)
                    if fallback_name:
                        print(f"    [Correction] Replaced '{ai_name}' with filename name: '{fallback_name}'")
                        data['name'] = fallback_name

                # 4. Create VCard Block
                vcard_block = create_vcard_string(data, creation_date)
                vcf_file.write(vcard_block)
                print(f"    -> Added: {data.get('name')} ({data.get('degree')})")
                count += 1
            else:
                print("    -> Failed to extract data.")

    print(f"\nDone! Created {output_filename} with {count} contacts.")

if __name__ == "__main__":
    process_to_vcf()